Build storage issues - resolved!

This is a very technical and detailed debrief. For those who want the short version - it’s fixed :-) Other people, read on.

As many of you already know - we had some pretty serious issues over the past weeks with the storage system that supports the build/unit test environment. We have resolved the issues and wanted to give everyone a run down of the issues that we found, what we have done to resolve them and what open tasks are left.

The issue manifested itself in a few ways. We saw slow transfers, scsi aborts, reservation failures and VM guest level corruption. This started as a very rare occurrence and over time became more and more frequent to the point that we could not keep a small number of i/o intensive VMs up for 1 hour and had trouble getting them off. We started troubleshooting the issue a few weeks ago, and finally came to a total resolution early this week. Here is a summary of the issues, how they came to be and how we resolved them:

* http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=226424 (A filer may exhibit poor performance due to WAFL holding on to too many network
buffers and not releasing them in a timely fashion.)
To fix this, we had to do an upgrade to 7.2.4 - that has been completed.

* NetApp LUN’s were created of the wrong LUN type.
This was caused by a error in the LUN creation workflow causing the LUN to be set to the default value (Solaris). 3 out of 4 LUNs were of type Solaris causing blocks to not be written efficiently to the disk (the 4k VMWare blocks were written offset to the true disk geometry). Reading of the LUN would cause many read aheads and at times overwhelm the filer due to the inefficient layout on disk. To remedy this we migrated data off, re-created all of the LUNs and re-migrated the data back.

* NetApp igroup’s set to the wrong type.
Initially Netapp advised that linux igroup type (what maps the LUN to various hosts) were OK for use with VMWare. This was incorrect causing improper scsi reservations and iscsi timeouts. NetApp is updating their internal documentation to reflect this change.

* Network setup issues
Initial setup from NetApp advised us to setup the network in a specific configuration (one link to each upstream switch with a virtual interface bonding them). After further investigation, I found this is *not* the best practice and in fact causing issues with dead HBA paths. To correct this temporarily, we disabled one of the links, having single uplinks (still with redundant heads)

All of these issues created major performance degradation and block level access/corruption problems. They have all been resolved at this point. We still need to adjust the network interfaces to be more redundant.

Special thanks to the release engineering team has been *incredibly* patient with us as we worked through this. I know how frustrating it was and they kept a smile (well, kind of) through the situation - really helped us keep pushing forward to a solution. Thanks also to mrz for the amazing amount of work he put into this…very dedicated to finding a solution no matter what time it was.

No Comments Yet

You can be the first to comment!

Speak Your Peace

You must be logged in to post a comment.