How we monitor our systems, services and infrastructure

I recently had a conversation with a customer about our monitoring systems, how we monitor our systems, services and infrastructure and how well monitored we are. It hit me that we’ve never really stated what we monitor, how we monitor it and the depth that our monitoring system has into our services. So I thought why not share this information so our customers can see how our monitoring systems works!

As the icon on our aspnix.com homepage says – “Monitored by PRTG”, we use Paessler’s PRTG Network Monitor. This application has went from a very small network bandwidth / traffic grapher and turned into a robust, complex (yet very easy to use), multiple use system. Not only does it monitor bandwidth, it monitors CPU usage, RAM availability, free HDD disk space, disk I/O and much more. Let me just list the different types of sensor probes that we use to monitor all of our servers and network devices, just so you can get an idea of what we monitor…

  • Network traffic throughput and bandwidth usage
  • Switch / Router port speeds and throughput (ports are individually monitored)
  • CPU usage (per device / server)
  • RAM usage (per device / server)
  • Network card throughput / bandwidth usage
  • Hard disk drive usage / free space
  • Hard disk drive throughput / I/O
  • Hyper-V heath points / performance counters
  • SQL Server performance counters (queries per second, long running queries and other counters)
  • System / device uptime
  • For Windows-based servers the “Last Windows Updates” counter is used to display the number of days since the last updates were installed
  • Our APC PDUs are monitored for their amperage / wattage usage as well as current input / output voltage
  • Services such as HTTP, FTP, DNS, SmarterStats, SMTP, IMAP, POP3, XMPP
  • For our Windows-based servers, we monitor the IIS performance counters for number of connections, POST / GET requests per second and other web server stats

Most of these sensors are checked every 30 – 60 seconds. So if something goes wrong, we are aware of it within seconds. We’ve even developed our own custom sensors that tie-in with other APIs such as SmarterMail or our own in-house application APIs / reports. Sensors like…

  • SmarterMail spool size, internal thread information / counts, server totals (for accounts, domains etc.)
  • Server CPU voltages, fan speeds, CPU and case temperatures as well as CPU speed
  • For TeamSpeak we monitor server totals, connection statistics such as bandwidth, latency and other details
  • On web servers we monitor totals for accounts, websites and statistics sites
  • For MySQL we monitor queries per second, total connections, active connections, threading totals and bandwidth

Most of these sensors, like the built-in ones, are checked every 30 – 60 seconds.

We also display a lot of our service sensors for public viewing right on our status page at https://status.aspnix.com/. The server / service graphs are displayed per server publicly for anyone and everyone to see. These are the same graphs that we see straight from our monitoring system, no editing, no “touch-ups”, just real, raw, and truthful statistics and uptime graphs straight from our system.

I hope that this quick write-up about our internal monitoring solution has helped you better understand how seriously we do take this. So if you see your site is not loading, if something is wrong on our end, trust me, we know about it and we are on it!

If anyone has any further questions about our system I will be happy to answer them, just post in the comments section!

Thank you for choosing ASPnix!