Cisco SD-WAN High Availability

The fundamental goal of every network is to provide continuous network service by rerouting around failures and potential causes of downtime. Cisco SD-WAN achieves high availability through a combination of four principles:

Controllers redundancy
vEdges redundancy
Robust network design
Disaster Recovery

Controllers Redundancy

All SD-WAN controllers run as virtual machines or containers either on or off-prem. Regardless of the deployment method, the main high availability principle is having multiple controllers from each type, preferably deployed at dispersed geographic locations. This ensures that the centralized control plane remains resilient if one of the controllers fails.

Because each SD-WAN controller serves a different function in the solution, each type utilizes a different scaling and connecting technique, as illustrated in the figure below.

Cisco SD-WAN High Availability Principles

vBond Redundancy

The Cisco vBond orchestrator serves two essential functions in the SD-WAN domain:

It validates and authenticates all devices that attempt to join the SD-WAN domain.
It orchestrates the establishment of control connections between the controllers and the vEdge routers.

The Cisco vBond orchestrator runs as a virtual machine either on-prem or in the cloud. However, it is the only controller that can also run on a regular vEdge router, configured to operate as a vBond orchestrator.

A highly available Cisco SD-WAN network has multiple vBond controllers working in an active/active manner, preferably deployed at different on-prem geographic locations or cloud regions. Then each SD-WAN device references the vBond orchestrator by a single FQDN name in its system configuration, as shown in the output below.

system
 vbond vbond.xyz.com
!

At the DNS layer, the organization associates multiple IP addresses with the vBond's DNS name. Generally, when an SD-WAN device queries the DNS server, the server sends back the IP addresses of all vBond orchestrators. Then the device tries each IP in succession, with the first one determined by a hash function until it establishes a successful connection.

In large-scale deployments, DNS servers in different geographic regions may be configured to resolve the vBond's URL to different IP addresses. The essential point here is that the DNS layer controls the process of defining which vBond orchestrators the vEdges use at the different sites/regions.

Notice that when there are multiple orchestrators within an SD-WAN domain, each vBond establishes permanent control connections to each vManage's and vSmart's core. This ensures that the orchestrator provides a list of valid controllers to the vEdge routers joining the overlay fabric. However, vBonds do not establish connections between themselves and do not exchange any network state.

In scenarios where a DNS server isn't available, it is still a best practice to use an FQDN name for the vBond orchestrator and use static host statements for the domain name resolution.

Cisco recommends as best practice that we configure an FQDN name rather than an IP address even in deployments with a single vBond orchestrator. This will allow for smoother scaling when the network starts to grow.

vSmart Redundancy

The Cisco vSmart controllers operate the centralized control plane of the network. They establish permanent DTLS/TLS connections with all SD-WAN devices in the domain. Over these control channels, they regularly exchange their views of the network domain to ensure that their centralized routing tables remain synchronized.

A highly available Cisco SD-WAN network has multiple vSmart controllers working in an active/active manner, preferably deployed at different on-prem geographic locations or cloud regions. Then, each vEdge router establishes control connections to two vSmart controllers by default, as shown in the figure below. When one of the controllers fails, the other one seamlessly takes over the control-plane functions of the network. The network control plane works without interruption as long as a single controller is operational in the SD-WAN domain.

Notice that vSmart controllers establish and maintain a full mesh of control connections among themselves, over which they form a full mesh of OMP sessions. All controllers synchronize their routing information base by exchanging vroutes, TLOC routes, policies, and encryption keys. Additionally, each controller establishes a permanent control connection to each vBond orchestrator. These control channels are then used by the vBond orchestrator to track which vSmarts are operational in the domain. When one of the controllers fails, the orchestrator will stop providing the IP address of the unavailable vSmart to the vEdge routers joining the SD-WAN domain.

When there are multiple vSmart controllers overseeing the network domain, we generally want to have control over the number of connections each vEdge router makes to vSmart. Cisco SD-WAN provides a few configuration parameters for this purpose:

max-omp-sessions [2 by default]: a global system parameter that specifies the number of different vSmart controllers that a vEdge router can attach to. Notice that a vEdge establishes one OMP session per vSmart, regardless of the number of control connections to the controller.
max-control-connections [2 by default]: an interface-specific parameter that defines the maximum number of DTLS/TLS control connections for the interface's local TLOC.

The important point is that in case there are more vSmart controllers than the max-omp-sessions parameter allows, the vEdges' OMP sessions will be hashed to a subset of vSmart controllers, as shown in the figure above. Notice that there are three vSmarts, but each vEdge router has established an OMP peering to two controllers only. This behavior is appropriate in scenarios where the organization has multiple controllers in the same data center/cloud region. However, there are multiple vSmart controllers at two locations, the default hashing algorithm is not the best approach. That is why the Cisco SD-WAN solution has introduced another parameter called control-groups.

Controller Groups

When an organization has multiple controllers deployed at two different locations, and the max-omp sessions parameter is set to 2 by default, we would like to make sure that each vEdge router connects to one controller from location 1 and one from location 2, as shown in the figure below.

The controller-group-id parameter is designed to tell vEdges that a group of vSmarts is deployed in a single data center or cloud region. By default, every vSmart controller is part of controller-group-id 0. This tells vEdges to connect to two controllers based on the hashing algorithm. However, when a group of controllers is deployed in one data center and another group in another data center or cloud region, we configure them with different controller-group-ids as shown in the figure above. When there are controllers with different ids, the behavior of vEdges changes. A vEdge router will always try to establish OMP peering sessions to different controller groups. Let's say that the max-omp-sessions parameter on the vEdges is the default one (2), if there are two controller groups, the router will form one OMP peering to group-1 and one to group-2. If there are multiple controllers in each group, the router will use the hashing algorithm to connect to one vSmart from controller-group-1 and one from group-2.

vSmart Controller Affinity

As the SD-WAN domain grows and spans multiple geographic regions, typically more vSmart controller gets added for resiliency. Generally, when an organization deploys controllers in more than two regions, it gets very important to ensure that vEdges connect to the vSmarts in the same or adjacent geographic region. For example, an organization has three US data centers - one on the east coast, one somewhere in the middle, and one on the west coast. In each data center, there is a separate vSmart controller group as shown in the figure below.

In such scenarios, we want to make sure that the vEdge routers located on the east side are connected to the controllers deployed in the DC-EAST and DC-MIDDLE, and the vEdges located on the west side connect to DC-WEST and DC-MIDDLE. We would not want to rely on the hashing algorithm to chose to which controller groups a WAN edge router will connect. Relying solely on hashing, a router located on the east coast could end up connecting to one controller in the west data center and another one from the middle. The vSmart affinity allows us to specify which controller groups a vEdge router will prefer to connect to. Configuring affinity is as simple as specifying a single config line, as shown below.

system
 max-omp-sessions 2
 controller-group-list 1 2
!

When there are more controller groups than the max-omp-sessions parameter, a vEdge router will connect to the controllers from the listed group-ids. Then within a controller group, the router will connect to a single vSmart controller based on the hash algorithm. When that particular controller becomes unavailable, the router will attempt to connect to another one in the same group.

Generally, we want to find the sweet spot between the number of control connections that a vEdge router maintains to vSmart and the level of resiliency that we want to achieve. In 99 percent of cases, it would be perfectly enough to leave the max-omp-sessions/max-control connections to their default values (2) and have the vEdge routers connect to the closest two vSmart controller groups.

vManage Redundancy

Organizations can deploy a Cisco vManage controller in two primary ways, either standalone or in clustering. All vManage controllers inside a cluster operate in active/active mode. The main benefit of a vManage cluster is performance and scale. A cluster provides redundancy against a single controller failure but doesn't protect against a cluster failure. Clustering across different geographical locations is not practically feasible because it requires a very high speed (>1Gbps) and low latency (<4ms) connectivity between the controllers. Organizations achieve resiliency against a cluster failure with a backup cluster in standby mode deployed in another data center or cloud region, as shown in the diagram below.

Full Content Access is for Registered Users Only (it's FREE)...

Learn any CCNA, DevNet or Network Automation topic with animated explanation.
We focus on simplicity. Networking tutorials and examples written in simple, understandable language for beginners.

Comments

msizi.mthembu

Sat, 03/19/2022 - 10:16

Thanks Ivan for this easy to understand explanation. Keep up the good work.

zerodha00@gmail.com

Sat, 04/16/2022 - 10:16

Amazed with your explanation skills, thank you for providing such an great material Ivan.. keep up the good work.

iamdheerajdubey

Fri, 05/06/2022 - 06:57

I have this query.
Suppose a cluster of three vManage.
Now when the edge router will first form the dtls tunnel, with which vManage it will form the tunnel?
Is it random to any one of the vManage whichever is available at that time? Or theirs any logical flow.

Ivan.Ivanov

Fri, 05/06/2022 - 11:24

Hi iamdheerajdubey,
When a vEdge router joins the overlay fabric, it receives the controller list from vBond. The list consists of the IP addresses of all 3 vManage controllers. The vEdge router chooses one vManage IP based on a hash function. Then it tries to form a control connection to the selected vManage over the local tloc with the lowest interface port number. The first successfully established connection is kept permanently.
In practice, if you have a topology with 900 vEdges and a cluster of 3 vMange controllers, you will end up with approx. 300 routers per vMange controller.
Hope it helps,
Ivan

SaidB

Tue, 05/31/2022 - 15:38

Hi,

Thanks Ivan for the clear explanation.

Can you elaborate on the hash function used to load balance the DTLS tunnels between the vSmarts and the Vmanage cluster members?

By the way, you rock !

Tue, 05/31/2022 - 16:57

Hi SaidB,
Thank you!
I don't exactly know how the hashing algorithm works but my understanding is this:
The router creates an ECMP hash key for each traffic flow from the combination of the source IP, destination IP, protocol, and DSCP field, by default. Using the "vEdge(config)# vpn 1 ecmp-hash-key" command you can include source and destination ports in the key as well.
The resulting hash key is a binary number (dunno the exact algorithm here). Then the router performs an XOR operation on the lower-order bits of the hash key (one bit when either of two tunnels needs to be selected, two bits for 3-4 tunnels, and so on) and selects one of the tunnels. Traffic flow with the same layer 3/layer 4 attributes always ends up on the same tunnel.
HTH, Ivan

Bassam.farghaly

Thu, 10/27/2022 - 23:54

Hi Ivan,

I think the cluster vManage devices itself could be an odd or even number but the configuration database and statistics database should be run on an odd number as explained in the below link.

https://www.cisco.com/c/en/us/td/docs/routers/sdwan/configuration/ha-sc…

Furthermore, I like your way of explanation

ajit.sasidharan

Sun, 04/09/2023 - 21:58

Thank you Ivan. Great explanation.

vishalch

Thu, 07/18/2024 - 06:16

Info to readers: vManage redundancy can be multiples of 3 in a cluster.
@Ivan what is the max vManage cluster and vSmart can be set?

soyeliel

Tue, 10/08/2024 - 17:13

The [2 by default] is incorrect.

The default value is set by the value configured under max-omp-sessions in the system section. This max-omp-sessions is by default 2.

https://www.cisco.com/c/en/us/td/docs/routers/sdwan/command/iosxe/quali…