1. Overview:
The fact that monitoring is one of the most important components of a network has been reiterated multiple times throughout this document. It is only with continuous monitoring that a network admin can maintain a high performance IT infrastructure for an organization.
Like the common practices which we discussed earlier, there are also best practices that are applicable to network monitoring. While common practices define the basic components that are essential for network monitoring and are applicable to every network, best practices for monitoring is a guideline to implement a good network monitoring strategy. Adopting the best practices can help the network admin streamline their network monitoring to identify and resolve issues much faster with very less MTTR (Mean Time To Resolve). Let us look at a few network monitoring best practices that are followed in many enterprises world-wide to help create a high performing network.
2. Baseline network behavior:
To be able to identify potential problems even before users start complaining, the admin needs to be aware of what is normal in the network. Baselining network behavior over a couple of weeks or even months will help the network admin understand what normal behavior in the network is. Once normal or baseline behavior of the various elements and services in the network are understood, the information can be used by the admin to set threshold values for alerts.
When an element in the network is malfunctioning, some of the metrics associated with the node performance would display a deviation from their mean value. For example, the temperature of a core switch in the network may shoot-up. The increase in temperature can be due to an increase in CPU utilization on the switch. Understanding the normal temperature and CPU utilization of the device will help the network admin detect the deviation and take corrective actions before a malfunction occurs.
Knowledge of baseline behavior in regards to network elements helps an admin decide the thresholds at which an alert has to be triggered. This aids proactive troubleshooting and even prevents network downtime rather than being reactive after users in the network start complaining.
3. Escalation matrix:
One of the reasons why potential network issues become an actual network problem is because the alerts triggered based on a threshold are ignored or the right person is not alerted. In a large network, there are can be multiple administrators or people who take care of different aspects of the network. There can be the security admin who looks at firewall devices and Intrusion Prevention Systems, the systems admin, or even an admin responsible only for virtualization.
When setting up monitoring and reporting, the organization should have a policy on who has to be alerted when a malfunction occurs, or a potential problem is detected. Based on the policy, the right person who is administers the network aspect that is having an issue can be alerted. This in turn can reduce the time needed for analysis which further reduced the MTTR.
In addition to alerting the right admin, an escalation matrix is also necessary. An escalation plan ensures that issues are looked at and resolved on time. Specifically, when the person in charge of that element is not available or takes a long time to resolve the issue. The implementation of a well-thought out escalation matrix prevents small issues from growing into large scale organizational-wide problems.
4. Reports at every layer:
Networks function based on the OSI layer and every communication in a network involves transfer of data from one system to another through various nodes, devices and links. Each element in the network that contributes to data transfer functions at one of the layers, such as cables at the physical layer, IP addresses at the network layer, transport protocols at the transport layer, and so on.
When a data connection fails, the failure can happen at any one of the layers or even at multiple points. Using a monitoring system that supports multiple technologies to monitor at all layers, as well as different types of devices in the network would make problem detection and troubleshooting easier. Thus, when an application delivery fails, the monitoring system can alert whether it is a server issue, a routing problem, a bandwidth problem, or a hardware malfunction.
5. Implement High Availability with failover options:
Most monitoring systems are set up in the network they monitor. This allows for quicker and better data collection from monitored devices. But if a problem occurs and the network goes down, the monitoring system can go down too, rendering all the collected monitoring data useless or inaccessible for analysis.
This is why it is recommended to implement a monitoring strategy with High-Availability through failover. High Availability (HA) ensures that the monitoring system does not have a single point of failure and so even when the entire network goes down, the monitoring system is accessible, providing data to the network engineer for issue detection and resolution. One method for HA is failover where the monitoring data collected by an NMS is replicated and stored in a remote site. In case of failure at the primary monitoring system, the failover system can be brought up (or automatically come up) and provide data needed for troubleshooting. And to avoid a single point of failure, it is recommended to set up the failover system at a remote DR site.
6. Configuration management:
Most network issues originate from incorrect configurations. There are several instances where even minor configuration mistakes have led to network downtime or loss of data. For example, when a new service is implemented in the network and firewall rules are being added, the person adding the new firewall rule may end blocking a business critical application, or allowing non-business traffic.
This is where configuration management is applicable. When configurations are changed on devices, which include network and security devices, like routers, switches, or firewalls—with the help of configuration management, the network administrator can verify that the changes being made do not break an already working feature. Configuration management can also be used for backing up working configurations, and to make bulk configuration changes, which otherwise could take a significant amount of time and prevent unauthorized changes. Unauthorized configuration changes to devices can lead to serious security lapses that include hacking and data theft. With configuration management, the admin can keep an eye on who is making a change, what change is being made, and even provide access control to configuration changes.
Configuration management is the proactive part of network monitoring. Furthermore, configuration management helps prevent issues from occurring in the network, rather than alerting about potential problems after they begin.
7. Capacity planning and Growth:
This applies to both the network in general and network management. When an organization grows, the IT infrastructure associated with the organization also should grow. An increase in business or addition of employees for an organization has effects on the number of devices needed, network and WAN bandwidth, storage space, and many more factors.
Monitoring systems allows you to keep tabs on resources in the network and be it with free, open-source, or licensed monitoring tools—there is always a limit on the number of resources and elements that can be monitored with a specific configuration or installation. In some cases, the server on which the monitoring system is installed may need upgrades to processing power and memory. In other instances, it might be the need to add-on installations to increase functionality, or in some cases it can be an increase in the license needed for the monitoring system.
When setting up a monitoring system account for future growth, it can affect the server sizing for installation, and for licensing—which controls the number of resources that can be monitored. Separate purchases, upgrades, or even moving to a new monitoring system as the network grows is much more expensive than spending a bit more capex when setting up the monitoring system.