Justin’s Blog
Mozilla IT/Operations…in brief
Network outage report - 3/18/08, 8:01pm PDT - 9:25 pm PDT
March 20th, 2008 by justin
We had a network outage at our San Jose datacenter tonight from 8:01 pm PDT until 9:25 pm PDT on March 18. From initial investigation, it appears that one of the switches in a blade server chassis had a software issue, causing a network-wide broadcast storm. Overall effect was that the switching fabric for our San Jose datacenter was unusable.
To mitigate this issue going forward, we have make two changes.
- Modified the port-channels connecting the core switches to downstream switches to better handle a port-channel member failure.
- We also further tuned broadcast storm protection on every switch port to limit the amount of broadcast & multicast traffic any one device is allowed to send.
Furthermore, we have a priority case open with the vendor to determine the cause of the issue as we did capture debug logs. This was in no way related to the scheduled downtime we were in, it just happened to coincide. We apologize for any inconvenience this may have caused. We’ll continue to follow up with the vendor to make sure this issue does not happen again.
Leave a Reply
You must be logged in to post a comment.