Enterprise Class Syslog Management

From NMSWiki

Jump to: navigation, search

Contents

Enterprise Class Syslog Management

(NOTE: This document is not complete, I need to get time to finish writing it :-))

Syslog is a valuable mechanism to proactively capture chronic issues affecting the network. It can identify many more exceptions and network degradation warnings than other forms of telemetry such as SNMP traps and therefore must by utilized by support organizations. There have been several cases where certain syslogs have identified a critical network issue for which there existed no SNMP traps. For example, an organization was able to identify a critical soft-error issue with their switch fabric module by the help of the %SYS-3-FAB_SYNCERR syslog message. This saved the customer valuable downtime since they were able to rectify the problem before the end customers started feeling the symptoms. Without such instrumentation, the only way the organization would have known about a problem would have been when users had called in complaining about it. However, owing to its verbose nature, syslog must be harnessed and implemented carefully and due diligence must be invested in defining adequate thresholds to generate actionable alerts. The process also requires that the problem management team is able to identify critical syslog messages easily and with equal ease, create problem tickets in the internal ticketing system. The syslog messages must also be prioritized according to the nature of the site emanating them; i.e. messages from critical sites must take precedence over those from non-critical deployments. These requirements are very customer-specific by nature due to the uniqueness of each individual network deployment. Harnessing the full potential of syslogs requires a streamlined process and underlying tool structure which provides for the collection, analysis, and action on all received syslog messages from the network.


Syslog Architecture

Syslog architectures should addresses these 5 key elements:

1. Event Analysis

2. Event Reporting

3. Event Remediation

4. Event Viewer

5. Event Logging Architecture


The following sections provide and overview of each of these elements. The solution itself is best defined within the Event Analysis section which outlines the steps necessary to ensure review and classification of messages.

The Event Reporting section suggests approaches to managing the initial large volume of messages that will be encountered prior to classification. Once completed, the ongoing maintenance of messages becomes much more manageable.

The Event Remediation section will address the response to critical messages while the Event Viewer section outlines the tool structure necessary to collect, store and classify incoming messages. There are tools readily available today that can be customized to your environment and can be augmented to provide ticket generation.

Finally the Event Logging Architecture section brings together each of the elements addressed.

This approach will provide you with the platform to effectively categorize syslog messages, establish the baseline for normal operation and ensure that messages of consequence are recognized and acted upon. Once in place, this system will categorize messages and ensure those of importance are given the attention required.


Event Analysis

There are approximately 35000 possible syslog messages supported by the various versions of Cisco's IOS. Naturally, it is imperative that only the critical, relevant messages are identified from this large number to make any analysis more cogent.

When an incoming event is received, two immediate questions should be asked:

1. Have we seen the event before?

2. Is it a non-critical event?

A database should be set up to store all incoming events along with a field indicating whether or not this is a non-critical event. If the event has been seen before and is marked as non-critical, no further action needs to be taken. If the event has never been seen before, it should be forwarded to operations for further review and, at that time, either acted upon or marked in the database as non-critical. Likewise, if the event is a critical event, an automated process should be used to open a trouble ticket for it (along with other automations).

Consideration may also be given to prioritizing incoming events based on their source. A “metric” field could easily be inserted into the database which adds “weight” to an incoming syslog event to indicate that this host has a higher priority than other hosts.

An outline of the proposed process is provided in the following flow-chart. The process follows a simple flow that enables the organization to begin filtering event logs as they enter the architecture.

As the implementation matures, fewer non-critical events will be seen by operations personnel, greatly increasing their reaction time to critical events.


Logic flow for syslog analysis

Event Reporting

Due to the size of these environments, it is nearly impossible to view every individual event coming into your management stations; this will be especially true during the initial deployment phase of this recommendation as all incoming events will be displayed by default until they are marked according to your company policies (critical vs. non-critical).

One of the simplest methods used to get a handle on event logs is to produce daily syslog reports which can be immediately used by the problem team to eliminate potential problems within the network.

This includes a mix of standard commodity reports such as:

  • Number of Syslog occurrences broken down by severity
  • Top-N syslogs in the network
  • Top-N devices generating syslogs

With regards to long term stability of your network: simply following up on these two reports (opening tickets and repairing the reported problems) will yield more impact than any other action you can take.

Apart from these, there may be several other reports built and used which look for trends and problem events. These trends or events are identified via intelligent analysis methods using event correlation, device health metric systems and baseline management techniques.

Engineers then collaborate to create a set of dynamic Syslog rules that are applied to all Syslog data generated by the network. This combination provides a more insightful view of network health and network optimization opportunities from a broad base of network expert knowledge. Some of these custom reports may be:

  • Number of Syslog occurrences broken down by “problem” messages
  • Detailed reports on each of the top-20 unhealthy devices (including problem description and recommended actions)


Event Remediation

It is recommended that problem tickets be opened against all essential and unhealthy devices in the network.

The problem management team would have responsibility of such tickets.

Cisco.com and various other sources contain details for each of the syslog messages identified, including Cisco recommended actions. These can be used as a starting point for troubleshooting the issue and identifying the root cause.

Critical Syslog Message and Recommended Action

Event Viewer

Developing or enhancing an existing syslog management tool is a pivotal activity which will contribute to the success of this solution. This will act as a one-stop shop for the problem team to have a correlated event list, sorted by both site and syslog criticality.

The high-level requirements for this tool are outlined in this section; however, software-level requirements are outside the scope of this document. Some general guidelines are provided to maximize the tool’s scalability and robustness.

  • The tool should use a web front-end which should allow easy access to all support groups within the organization.
  • It should incorporate a standard user authentication system, which can then be used in ticket generation.
  • Syslog storage, de-duplication and correlation information associated with the messages require a back-end relational database.
  • The tool should be able to de-duplicate incoming syslog messages. For example, a device running low on memory with accounting turned on can send the following message very frequently: %AAAA-3-DROPACCTLOWMEM: Accounting record dropped due to low memory
  • The tool must show this message only once and display the “received count” in another column. This will help in making the GUI more usable.
  • The tool should incorporate a top-N syslog severity report. The report is very useful in highlighting chatty devices and will directly help the review activity outlined in this document. An enhancement to such a report could be the introduction of charts, which provides for better visual breakdown of the messages.


Sample Pie-chart Display of Syslog Event Breakdown


  • The tool should be able to differentiate between messages received from critical sites and non-critical sites.

This distinction will assist the incident management team in prioritization, so that they handle the high-impact messages first. This can be provided as another sortable column in the database table which marks the message with a weighted metric number based on the device it’s being received from.

  • The Tools development team would need to collaborate cross-functionally with the inventory/asset management team who house all the inventory data including information on which sites/devices are critical as deemed by your organization.
  • The SEM (Syslog Event Management) GUI should provide an option of selecting syslog messages and clicking on a button to generate a ticket directly from the main screen.
  • The ticket must be queued up for the problem management team’s ticket queue and should contain all relevant details of the message, including the received count at that instant. The engineer assigned may be required to update the severity as needed.

Event Logging Architecture

Below is a list of leading practices for large scale syslog management. Like many leading practices, these are not meant to be adopted blindly, but rather evaluated for how each fits into your environment.

Collection Stations

Design your syslog architecture in a distributed, hierarchical fashion.

  • Syslog collectors should lie as close to their networks as possible.
  • Some filtering should be done at the collection level to weed out unnecessary log data.
  • These collectors should store and forward all filtered messages to a centralized server/database for further filtering and processing.
  • All log data should be stored in a database such as MySQL (set up using a PROGRAM pipe so as not to store messages in files).

Device logging levels

  • Devices should be set to log all messages 0-6 for normal operation (and possibly 0-7 for debugging - although, if you are debugging, you're probably doing so on the console of the device and may not need to send level 7 to the collectors).

Network Time

  • It is imperative that you enable NTP throughout the architecture to ensure proper timestamps.

Syslog Event Manager

  • Deploy an automated tool to establish a baseline of your logs.
  • Assign people (or groups) to monitor daily Top X errors and remediate common problems.

Log Rotation and Retention

  • Establish log retention and rotation policy.
  • Rotate logs every day, keeping at least the last 30 days for forensic work (this varies greatly by network size, but it's a good starting point).
  • Include logs and log archives in a standard backup process.
Views
Personal tools