Billing Control Panel Client Voip ms sql webadmin ms sql backup mysql webadmin Server Status

ASPnix Official Blog

  • SQL Server Outage Explanation

    No comments October 10th, 2016 15

    This is the information we have so far for the outage of SQL Server on 10-9 through 10-10. Microsoft’s technical team and SQL Server development team has worked with us since early this morning to figure out what happened, what could be done to prevent in the future and a fix if possible from them.

    At around 3PM MDT we received a very small handful of complaints about connection issues to SQL Server, we investigated but found no issues and were unable to reproduce any problems. This is when the first issue started. Just after 7:30PM MDT the entire SQL Server service and process failed and closed without shutting down cleanly. This immediately caused issues with about 130 databases. These databases went into “Suspect” status, while about 50 or so databases went into “Emergency” status and about 80 databases went into “Recovery” status. We were able to restart the SQL Server service, but with a large number of errors (about 100 per second) to the event logs.

    We were able to recover most of the “Suspect” databases, but none of the “Emergency” databases. The master database was also giving out lots of errors and other issues. At this point, we zipped up the logs directory to give to Microsoft and proceeded to revert to backups. The SQL server Windows installation and all databases are backed up every 3 hours (bare metal) we first restored the 6PM MDT backup, however upon starting, we noticed the same issue as earlier. It appeared that the corruption was occurring before SQL Server crashed. We proceeded with the 3PM backup, unfortunately, we found the same issues on this image as well, though not as much. So we ultimately had to fall back to the 12PM MDT backup which had no issues or errors and immediately started.

    Microsoft responded to us early this morning stating they were looking into it, however due to the very large number of logs, crash dumps and their size, it may take a few hours. Today around 12PM MDT they got back to us with a fairly “simple” explanation about what they believe happened.

    At around 3PM MDT the thread count of SQL Server began climbing, slowly, but at a steady rate, by 7PM MDT the number of open threads was in the 10s of thousands. At this point, the Windows CPU scheduler / threading manager gave up, the OS locked, SQL Server crashed as it had no further resources to do any further actions. Once the service crashed, Windows did recover enough for us to cleanly reboot the server. At this time Microsoft is not sure what caused this, if it is a CPU problem, a Windows problem, something internal to SQL Server, a BIOS issue, something in the processor’s microcode etc. They are continuing this later this week with the Windows support / developers to see what may have happened. As they get more information about what could have caused this issue we will update this post.

    Currently the state of SQL Server is 100%. All databases passed all DBCC CHECKDB checks without errors. We are now more closely monitoring SQL Server’s processes handles and threads. With guidance from Microsoft and their recommendation on limits based on our hardware we have fine tuned our PRTG monitor system to better alert us to possible problems.

    We’d like to thank everyone for their patience and understanding while we worked to resolve this.

    Thank you for choosing ASPnix! 

    1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 5.00 out of 5)
  • SQL Server Outage

    4 comments October 10th, 2016 36

    We are currently working to restore SQL Server from backups. Unfortunately today around 7PM MDT our SQL Server encountered a fatal error that caused the service to crash. This crash caused corruption in several databases. After getting the service back online, we noted many databases that appeared to be OK, but were failing DBCC checks. Currently we are restoring from bare metal backup which will take several hours to complete. This recovery point is from 1PM MDT on 10-9-2016.

    We’d like to thank everyone for their patience during this time.

    EDIT: Server is now online. All databases are now passing DBCC checks without errors.

    1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
  • MX01 SmarterMail – Emergency Maintenance

    No comments July 8th, 2016 273

    We will be performing emergency maintenance on the MX01 SmarterMail server today at 6PM MDT (12:00AM UTC) in order to replace the failed Power Distributor and get the server back to 100% working order. We expect this to take about an hour (60 minutes). During this time email will not be able to be sent or received.

    Once Power Distributor is replaced, the server will return to full performance as we’ve had to turn down the RAID’s performance as well as throttle the CPU to keep power usage under control as the Power Supply the server is currently running with is undersized.

    Thank you for your patience during this time.

    1 Star2 Stars3 Stars4 Stars5 Stars (3 votes, average: 3.67 out of 5)
  • MX01 SmarterMail – Outage Report for 06/29/2016

    No comments June 29th, 2016 289

    Incident Report

    At 6:30PM UTC (06/29/2016) our primary SmarterMail server experienced a full outage that interrupted all incoming and outgoing mail.

    Timeline (times are UTC)

    • 6:30PM Outage occurred
    • 6:33PM Technician dispatched to verify local equipment
    • 6:45PM Technician notified that server was unable to power on
    • 6:45PM Technician replaced both power supplies and notified server was still not powering on
    • 7:20PM Technician replaced motherboard and notified server was not unable to power on
    • 7:30PM A replacement standard ATX power supply was ordered and scheduled for pickup
    • 8:45PM Technician installed power supply and notified server successfully powered on
    • 9:00PM Server online and operational

    Root Cause

    The primary SmarterMail server (MX01) uses dual power supplies for redundancy in case of a single PSU failure. After contacting Supermicro we were notified that the internal power distribution unit that the dual power supplies plug into had failed causing the server to receive no power. This is not a component that is stocked at our data-center and not something that can be picked up locally and must be ordered directly from Supermicro.

    The server is currently operational, but using a standard ATX power supply. Once we’ve received the replacement part from Supermicro, maintenance will be scheduled to replace it so that the server can be restored to full working order.

    We’d like to thank everyone for their patience and understanding during this time and for choosing ASPnix!

    1 Star2 Stars3 Stars4 Stars5 Stars (3 votes, average: 5.00 out of 5)
  • MX01 SmarterMail Outage (Resolved)

    No comments June 29th, 2016 291

    We are currently investigating an outage of our primary SmarterMail server. At this time the server is unavailable. Incoming email will be captured by our backup gateways and will be delivered once the primary server resumes operations.

    We will have updates and an ETA for restoring of services shortly.

    Thank you for your patience during this time.

    1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 5.00 out of 5)
  • Denver, Colorado – Level 3 Outage Report for 06/01/2016

    No comments June 1st, 2016 531

    Incident Report

    At 3:31PM UTC (06/01/2016) our Level 3 Data Center in Denver experienced a network outage that resulted in the full loss of service for nearly all services.

    Timeline (times are UTC)

    • 3:31PM Outage occurred
    • 3:40PM Technician dispatched to verify local equipment
    • 3:48PM Technician verified local equipment functionality
    • 3:50PM Ticket opened with Level 3
    • 4:05PM Ticket escalated to tier 1 support
    • 4:22PM Ticket updated that Level 3 was investigating the cause of the outage
    • 4:37PM Ticket updated to notify that equipment has been verified and that the fiber connections are good
    • 4:42PM Ticked escalated to tier 2 support
    • 5:20PM Ticket updated to notify that the connection is being reset/ rebuild
    • 5:40PM Ticket updated to notify that our services should be back within 1 hour
    • 6:38PM Ticket updated by ASPnix to notify that network was still offline
    • 7:05PM Ticket escalated to tier 3 support
    • 8:48PM Ticket updated to notify that the issue had been resolved and our services would be restored within minutes
    • 8:51PM BGP connection reestablished on both ASPnix routers and service restored

    Root Cause

    ASPnix uses 2 private AS numbers provided by Level 3 to allow for BGP routing between our routers and the Level 3 edge routers that service our circuits. This allows for failover between routers in the unfortunate event that one fails. Due to an outstanding order that was tied to our account, their system automatically closed the order and marked our AS numbers as inactive. Within 5 – 7 minutes their routers received the updated configuration and notified that these routes are now inactive and were removed resulting in all traffic to and from our network to be unroutable.

    Because of the missing AS numbers under our account, the tier 2 support technician assumed we did not use BGP and rebuilt our connection without it. Our routers are configured for BGP and advertise BGP, as a result they were unable to establish a connection and advertise our IP routes to the edge router.

    Tier 3 support was able to see that our configuration was invalid and we provided our BGP information and previous AS numbers. The technician was able to restore the BGP routes and within minutes our service was back online.


    We will be issuing credits to all customers in the next few days. Credits will be automatically applied to each account based on any active service(s). This excludes domains and service addons.

    If you have any questions or concerns, please open a ticket to the technical support department for your service. 

    We’d like to thank everyone for their patience and understanding during this time and for choosing ASPnix!



    1 Star2 Stars3 Stars4 Stars5 Stars (6 votes, average: 5.00 out of 5)
  • TS01-UT TeamSpeak Outage (Resolved)

    No comments May 31st, 2016 661

    Our TS01-UT TeamSpeak server is currently experiencing an unexpected outage. We are working with our upstream provider to find the cause is and resolve it. We apologize for any inconvenience caused by this.

    Thank you for your patience and for choosing ASPnix!

    1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
  • MSSQL01 Outage (Resolved)

    4 comments April 19th, 2016 627

    Currently the MSSQL01 SQL Server is offline due to a failure in the RAID subsystem. During a normal rebuild of the RAID array after a failed drive had been replaced, additional drives failed within 5-10 minutes of each other causing a loss of the RAID and all data stored within it.

    Currently the server is being restored from the last bare metal backup image (12:06AM MDT). We expect this rebuild will take a few hours to complete. Once done, the server will be restored to full operational status.

    Thank you for your patience during this time.

    1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
  • TS01-GA Network Outage

    3 comments April 18th, 2016 843

    Currently the data-center that hosts our TeamSpeak services in Atlanta, Georgia is experiencing a network interruption. Network engineers are currently working to resolve the issue. At this time we do not have an ETA.

    We’d like to thank everyone for their continued patience as this issue is being addressed.

    Thank you for choosing ASPnix!

    1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
  • Denver Colorado Outrage / Service Interruption

    No comments April 17th, 2016 523

    This evening at 4:08PM MDT an electrical issue with one of our cabinets power delivery units (PDU) caused a brief interruption or all servers and services in our Colorado data-center. Technicians were dispatched, however upon arrival were unable to identify any issues as services had resumed normal.

    Another electrical issue at approximately 5:25PM MDT caused a major electrical fault in the cabinet that handles all network communications for our network. The fault triggered the data-center power delivery systems to cut power to the cabinet resulting in a full outage of our Colorado location. The technician who responded found the power delivery unit had fully failed. Once the unit was removed, power was restored to the cabinet.

    The total outage time was about 58 minutes between the 2 interruptions.

    After contacting Level 3’s electricians and discussing options to better handle these unfortunate events in the future, we will be purchasing additional electrical equipment to have better fault tolerance for these issues in the future.

    We apologize for any inconvenience during this time. If there are any questions or concerns, please open a support ticket the technical support department for your service.


    1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)