SQL Server Outage Explanation

This is the information we have so far for the outage of SQL Server on 10-9 through 10-10. Microsoft’s technical team and SQL Server development team has worked with us since early this morning to figure out what happened, what could be done to prevent in the future and a fix if possible from them.

At around 3PM MDT we received a very small handful of complaints about connection issues to SQL Server, we investigated but found no issues and were unable to reproduce any problems. This is when the first issue started. Just after 7:30PM MDT the entire SQL Server service and process failed and closed without shutting down cleanly. This immediately caused issues with about 130 databases. These databases went into “Suspect” status, while about 50 or so databases went into “Emergency” status and about 80 databases went into “Recovery” status. We were able to restart the SQL Server service, but with a large number of errors (about 100 per second) to the event logs.

We were able to recover most of the “Suspect” databases, but none of the “Emergency” databases. The master database was also giving out lots of errors and other issues. At this point, we zipped up the logs directory to give to Microsoft and proceeded to revert to backups. The SQL server Windows installation and all databases are backed up every 3 hours (bare metal) we first restored the 6PM MDT backup, however upon starting, we noticed the same issue as earlier. It appeared that the corruption was occurring before SQL Server crashed. We proceeded with the 3PM backup, unfortunately, we found the same issues on this image as well, though not as much. So we ultimately had to fall back to the 12PM MDT backup which had no issues or errors and immediately started.

Microsoft responded to us early this morning stating they were looking into it, however due to the very large number of logs, crash dumps and their size, it may take a few hours. Today around 12PM MDT they got back to us with a fairly “simple” explanation about what they believe happened.

At around 3PM MDT the thread count of SQL Server began climbing, slowly, but at a steady rate, by 7PM MDT the number of open threads was in the 10s of thousands. At this point, the Windows CPU scheduler / threading manager gave up, the OS locked, SQL Server crashed as it had no further resources to do any further actions. Once the service crashed, Windows did recover enough for us to cleanly reboot the server. At this time Microsoft is not sure what caused this, if it is a CPU problem, a Windows problem, something internal to SQL Server, a BIOS issue, something in the processor’s microcode etc. They are continuing this later this week with the Windows support / developers to see what may have happened. As they get more information about what could have caused this issue we will update this post.

Currently the state of SQL Server is 100%. All databases passed all DBCC CHECKDB checks without errors. We are now more closely monitoring SQL Server’s processes handles and threads. With guidance from Microsoft and their recommendation on limits based on our hardware we have fine tuned our PRTG monitor system to better alert us to possible problems.

We’d like to thank everyone for their patience and understanding while we worked to resolve this.

Thank you for choosing ASPnix!