A few weeks ago, I wrote an article about attacking HSRP (Hot Standby Redundancy Protocol) and how to defend it from such attacks. Today, I’m going to discuss how to optimize HSRP timers. You may ask, why bother optimizing HSRP timers? One of the main reasons is to provide a better user experience in the case of failure.
There are two timers that you can manipulate to optimize HSRP. The first one is the hello time. The hello time is the interval in seconds or milliseconds between successive HSRP hello messages.
The second one is the hold time. The hold time is the interval in seconds or milliseconds before the active or standby router is declared to be down.
The default values for these timers are the following:
- Hello time: 3 seconds
- Hold time: 10 seconds
Optimizing HSRP timers
IOS software releases support millisecond hello and hold time values. While you can manipulate these timers to achieve sub-second failover, Cisco does not seem to recommend it.
The lowest hello time that Cisco recommends is one second and four seconds for hold time. The recommendation has been in place since I have been in the networking field (mid-2000s).
Cisco documentation dissuades you from lowering these timers because of the increased CPU usage and could cause unnecessary state changes1. It is also confusing that the same documentation page says they recommend a minimum of 250 ms for hello time and 800 ms for hold time.
However, if you open a ticket with Cisco TAC, there is a good chance that they will stick to the general recommendation of no less than one second hello time and four seconds hold time. Your mileage may vary.
Since Cisco does not recommend sub-second hello and hold times, how can you achieve it? Enter BFD (Bidirectional Forwarding Detection). BFD is what Cisco recommends to achieve sub-second failover.
What is BFD?
BFD is a lightweight protocol that provides short detection of failures in the path between adjacent forwarding engines, including the interfaces, data link(s), and, to the extent possible, the forwarding engines themselves.
Essentially, BFD is less CPU-intensive (if performed in software) than HSRP messages. Hence, the reason why Cisco recommends BFD rather than manipulating the HSRP timers.
Configuring HSRP BFD peering is easy. In some cases, you only need to define the BFD parameters, and HSRP will start using it. At least, this is the case with certain IOS-XE releases running on ASR 1001-X and ASR 1001-HX. Your mileage may vary.
To configure BFD on an interface, use the following commands, as shown below. You must configure BFD on both routers. You can lower these values since these are conservative.
R1(config)#int g0/0 R1(config-if)#bfd interval 250 min_rx 250 multiplier 3 R2(config)#int g0/0 R2(config-if)#bfd interval 250 min_rx 250 multiplier 3
interval: This value specifies the rate, in milliseconds, at which BFD control packets are sent to the peers. The range is from 50 to 9999.
min_rx: This value specifies the range, in milliseconds, at which BFD control packets are expected to be received from the peer. The range is from 50 to 9999.
multiplier: This value specifies the number of consecutive BFD control packets that must be missed from the peer before it is declared as unavailable. The range is from 3 to 50.
The next step is to verify that HSRP BFD peering is enabled. Some IOS versions enable this feature by default. In my EVE-NG environment, my IOSv router running on version 15.7(3)M3 does not have this feature enabled by default.
To check if the HSRP BFD peering is enabled, issue the
show run all | i standby bfd all-interfaces command. If the feature is disabled, you can enable it by using the previous syntax or issue
standby bfd in each HSRP interface.
ASR1001-X#show run | i standby bfd ASR1001-X# ASR1001-X#show run all | i standby bfd standby bfd all-interfaces IOSv1(config)#standby bfd all-interfaces IOSv2(config)#int g0/0 IOSv2(config-if)#standby bfd
If you ever see a lot of HSRP state changes, make sure to check Layer 1 issues first. I have seen it where input and CRC errors will cause HSRP state changes.
Another possible issue, at least on ASR1K, is the default platform punt-policer value. I have seen BGP peerings, with BFD enabled, bounced because the default platform policer value was low for the environment. I had to issue the command
platform punt-policer 45 6000 high to raise it.
HSRP with BFD is a great feature to use to achieve sub-second failover. However, it is not always the answer. For example, on the ASR1K platform, you cannot set the BFD values lower than 750ms on port-channel interfaces.
To achieve sub-second failover, you would need to manipulate the hello and hold timers instead. If you encounter this scenario, be conservative on your HSRP timer values. Follow the 250/800 timer recommendation and monitor. Optimize if needed.
There is a study released by Concordia University College of Alberta regarding network instability in FHRP sub-second timer implementation. They concluded that even when CPU utilization topped at 100%, there were no network instabilities observed3. That said, you may be able to get away with lower hello and hold time values.
You might like to read
1 First Hop Redundancy Protocols Configuration Guide
2 Bidirectional Forwarding Detection
3 Study of Network Instability in VRRP and HSRP sub second timer implementation