Network monitoring solutions are great at answering “Is my device down?”, however not all solutions can answer “Why is it down?”
This question becomes even harder to answer when you don’t have physical access to the device. When you have tens, hundreds or even thousands of satellite offices or retail store locations, you need a quick and reliable method to determine the probable cause and extent of connectivity loss in order to efficiently respond to the event.
The customer’s requirement.
Our customer is a North American retailer with over 10,000 stores nationwide.
Our customer had several long-standing contracts with regional operators whose role was to respond to hardware issues at any site in their city, state or region. There was concern around the impact on the business resulting from the time taken to identify and address hardware issues.
Once a device was reported as down, our customer needed to determine if the issue was hardware related, requiring a ticket being submitted to the local operator, or if the unavailability of the remote site was due to an environmental factor such as a power outage.
If a location was experiencing hardware issues, what was the extent of the issue and therefore impact on the location’s ability to function? How was the customer to prioritize the resulting support tickets?
Ideally, what the customer, and what other businesses with this issue and similar questions, require is:
- A simple presentation of current device state across all business sites
- An “at-a-glance” indication of the extent of any down event. From this they can derive the expected impact on the location’s ability to function
- Some indication of the probable cause be it hardware, or environmental, such as a power outage
In addition, with a business concern as large and widespread as our customer’s, the solution should also include:
- Targeted alerting for individual and clustered down events
- A NOC screen component presenting network-wide outages in a simple visual format
While this customer has a significant number of retail stores, the same issue is encountered by any business operating out of multiple sites: how do you determine the nature of an outage at a remote location, and how do you do it quickly? When a remote site is identified as being down by the network, is it due to a device being down, a drop in connectivity due to a store closing down, a very local power outage, or a more widespread environmental issue such as outage due to inclement weather? The process a business needs to follow to respond to this type of issue in the most efficient and effective manner is entirely dependent on the cause or nature of the outage.
“We had an instance recently in Michigan, where we had 3 or 4 stores locally grouped together that were experiencing a power outage due a snowstorm.” – Store Technical Support
Previously, our customer had deployed Statseeker to monitor its core and WAN edges in the enterprise network. This included the infrastructure relating to the business operation – the head and regional offices and corporate facilities – but not the retail stores. Monitoring of the retail stores was outsourced. However, when it came to the issue of down devices in store locations, the information available from the third party was not considered sufficient or timely enough for many internal elements of the business. Our customer relied on a script that would ping both primary and backup connection routers to provide some on-demand data.
“If both [routers] were down, there was a 90-95% chance that store didn’t have power and then we would feed that information on to our business partners.” – Senior Manager, Store Technology Service
This script was manually run on an hourly or two-hourly basis, and wasn’t run against the entire chain of stores, instead it was manually targeted at just those locations that they believed to be adversely affected by environmental issues such as storms, wildfires and hurricanes.
By having Statseeker monitor its retail store in addition to the corporate offices, our customer would be able to collect data from their store locations and use that data to build reports, trigger alerts, and populate dashboards. While configuration data for these devices is collected and updated daily, both timeseries (load, temperature, traffic rate, etc.) and event data, including current device state, is collected every minute.
The Statseeker solution included a member of our support team writing a script that updated the Statseeker configuration data for the customer’s monitored devices with the latitude and longitude of the store in which the devices are located. Once the location data for a device was known to Statseeker, any other data associated with that device, including device state, could then displayed on a map panel in a dashboard.
“Every morning I come in, I can look at the map and if I see one or two primary routers down, I know that if I don’t have tickets open on those then I need to get those out to the field as soon as possible, because we don’t want the store to come in and open without a primary connection.” – Store Technical Support
Each store location has three pieces of core network infrastructure that are monitored by Statseeker, including the primary and backup connection routers. A single map point is used for each location, with each point changing both color and size depending the number of monitored devices currently reported as being down:
- Yellow – 1 device down
- Orange – 2 devices down
- Red – 3 devices down
A yellow or orange site indicates that a ticket needs to be entered to get an engineer to the store to investigate and resolve the issue. Hovering over the site markers on the map displays an on-screen pop-up detailing which device is down. In addition, each site marker can be selected to present detailed information for the devices located at that site.
“For single device outages we find that the backup router going down doesn’t necessarily impact the store, but with Statseeker we’ll see that they are going down before we get notified by the current vendor, so we can already take action before they tell us.” – Store Technical Support
The presence of a larger red marker indicates that the entire location is currently unreachable and may be subject to a regional power outage. This information is updated in real time and can be used to trigger notifications and drive a suitable response throughout the organization. This panel configuration can be reused in any number of dashboards and since it is presented via a web interface it is available everywhere it is needed, including content to be included in any modern NOC presentation framework.
By default, Statseeker records important event data relating to your monitored hardware (changes to ping state, ifOperStatus and ifAdminStatus). A filtered view of this data is ideal for the structured alerting options offered by Statseeker and combines perfectly with additional topographic filters, such as our upstream device configuration, to ensure that the alert content is clean and focused.
“We had a BGP outage last Wednesday and I had received the BGP outage alert from Statseeker. I was able to go into the other monitoring tool and see that not all of the tickets had been opened for that BGP outage. With Statseeker we were able to get that outage and those numbers over to our account reps. That became a requirement for us, and with Statseeker we are able to monitor BGP for our T1s and broadband.” – Senior Manager, Network Infrastructure
During the process of Statseeker designing and delivering this solution, the customer began to turn its attention to other elements of the network and the range of data that Statseeker is able to present as either a report, a real-time dashboard, or an alert. This now allows them visibility of a number or elements on their network where the data coming from the incumbent system was previously incomplete or entirely unavailable.
“The peace of mind comes from knowing what’s down at any given time.” – Store Technical Support
Contact us today to find out more about how we can help your organization adapt, now and in the future.