Here is something that happens with alarming regularity. An organization orders two internet circuits from two different providers. They install dual routers, configure failover, and put "resilient internet connectivity" on the risk register as mitigated. Everyone feels good about it. The IT director reports to the board that the site has diverse, redundant connectivity. The auditors tick the box.
Then a contractor with a backhoe severs a duct on the street outside the building. Both circuits die simultaneously. The failover worked perfectly. It failed over from a dead circuit to another dead circuit. The two "diverse" providers were both leasing last-mile fiber from Openreach, running through the same physical duct, into the same chamber, up the same riser. The diversity existed on paper, in the carrier's provisioning system, and nowhere else.
This is not an edge case. It is the default outcome when you buy connectivity without understanding infrastructure topology. And it illustrates the fundamental difference between redundancy and resilience, two concepts that get conflated constantly, to expensive effect.
Defining the terms precisely
Redundancy means having a spare. A second power supply, a second switch in a stack, a second WAN circuit. If the first one fails, the second one is there. Redundancy protects against independent component failure: a transceiver dies, a power supply trips, a fiber develops a fault. These are real failure modes and redundancy handles them well.
Resilience is fundamentally different. Resilience is the ability of a system to maintain acceptable operation through a range of failure scenarios, including correlated failures where multiple components go wrong simultaneously because they share an underlying dependency. Resilience encompasses redundancy but also demands path diversity, technology diversity, automatic detection and recovery, operational readiness, and (critically) verified independence between the things you think are protecting you.
You can have redundancy without resilience. Two circuits through the same duct are redundant. They are not resilient. You can also, in principle, have resilience without redundancy. A single circuit with robust SLA commitments, proactive monitoring, and a well-rehearsed manual fallback to cellular is arguably more resilient than dual circuits through shared infrastructure with untested failover. In practice, however, resilience usually requires redundancy as one of its building blocks. The mistake is treating the building block as the finished structure.
Physical path diversity: the foundation everyone gets wrong
Physical path diversity is where resilience designs fail most often, because it requires knowledge that sits outside the normal visibility of both the customer and the service provider.
Consider how UK fiber connectivity typically works. You order a circuit from, say, Colt or Zayo or Neos Networks. That carrier has their own fiber network, their own Points of Presence, their own core routers. But the last mile (the stretch from the nearest exchange or PoP to your building) is frequently delivered over Openreach infrastructure. Openreach owns the ducts, the chambers, the cable routes under the pavement. Your carrier leases capacity on that infrastructure. And so does the other carrier you bought your "diverse" second circuit from.
Both carriers will show you diagrams of their own networks. Different autonomous systems, different backbone routes, different Points of Presence. The diagrams look diverse. But underneath, in the physical layer that neither carrier's NMS can see, both fibers run through the same Openreach duct from the same BT exchange to the same footway box outside your building. They might even be in the same fiber bundle.
This is not the carrier lying to you; their network genuinely is diverse from their perspective. They cannot see Openreach's duct topology. Their provisioning system shows a circuit from their PoP to your premises. It does not show which specific duct segment that circuit occupies, or whether another carrier's circuit to the same building shares it. That information lives in Openreach's PIA (Physical Infrastructure Access) records, and most ISP sales teams have never looked at it.
What "diverse" actually means to a carrier
When a carrier says a circuit is "diversely routed," they typically mean one of several things, and you need to establish which:
Diverse within their core network. The circuit takes a different backbone path from your other circuit with the same carrier. This protects against a core fiber cut on their network. It does nothing for last-mile diversity.
Diverse from a different exchange. The circuit is delivered from a different local exchange or aggregation node. This is more meaningful: it means a failure at the exchange doesn't take out both circuits. But the duct routes from two different exchanges may still converge on the same street-level infrastructure before reaching your building.
Diverse at the physical duct level. This is what you actually need, and it is the hardest to achieve and verify. It means the fiber runs through entirely separate duct infrastructure: different trenches, different chambers, different building entry points. Getting this confirmed requires a duct survey from the infrastructure owner, not a reassurance from the ISP's account manager.
How to actually verify physical diversity
Start by asking your carrier a specific question: "Who owns the duct infrastructure for the last mile of this circuit?" If the answer is Openreach (which it will be for most UK premises outside of carrier-lit buildings), you know that any other circuit also delivered over Openreach ducts is potentially sharing infrastructure regardless of which ISP provides it.
Next, request a duct survey. Openreach publishes duct route information through their PIA product set, and carriers ordering Ethernet services can request route confirmation. What you want is the specific duct route. Which duct sections, which chambers, which building entry point. Get this for both circuits. Overlay them. If they share any duct segment, they are not physically diverse on that segment.
Look at building entry. Many commercial buildings have a single telecom riser or a single duct entry point from the street. Every circuit into that building, regardless of carrier, comes through that one entry. A fire in the riser, a water leak in the entry chamber, and every circuit goes down together. True diversity means separate building entry points: one from the front of the building, one from the rear, ideally from different streets entirely.
In some locations, genuine physical diversity simply is not available at any reasonable cost. There is one duct route to the building and every carrier uses it. When that is the reality, acknowledge it openly. Put it on the risk register as a known limitation. And compensate with technology diversity. A fiber circuit paired with a wireless backup that shares nothing with the duct infrastructure. Pretending diversity exists when it does not is worse than a known gap, because known gaps get compensating controls.
Automatic failover: the gap between design and reality
Having diverse paths is necessary but not sufficient. Traffic needs to move from a failed path to a working one, and the speed and reliability of that movement determines whether your users experience a brief blip or a prolonged outage.
VRRP, HSRP, and gateway failover
At the LAN edge, gateway redundancy protocols like VRRP (Virtual Router Redundancy Protocol) and Cisco's proprietary HSRP (Hot Standby Router Protocol) provide automatic failover of the default gateway. One router is active, the other is standby. When the active router fails, the standby takes over the virtual IP address and becomes the default gateway.
The theory is clean. The practice has rough edges. Default VRRP timers use a 1-second advertisement interval with a hold time of roughly 3 seconds (3x the advertisement interval plus a skew). That means failover takes approximately 3-4 seconds in the best case. HSRP is similar, with a default hello of 3 seconds, hold time of 10 seconds, so a failover can take 10+ seconds with default timers.
You can tune these down. VRRP and HSRP both support millisecond timers, and with aggressive tuning you can achieve sub-second gateway failover. But aggressive timers create their own problems. On a congested network or across a stretched VLAN, a few dropped hello packets can trigger a false failover. The standby router takes over, then the original router recovers and takes it back, then drops another hello. And you get a flapping gateway that is worse than a clean failure. Tuning failover timers requires balancing detection speed against stability, and the right answer depends on your specific environment.
The bigger issue with VRRP/HSRP is that they only handle gateway failure. They do not detect upstream path failure. Your active router is up, its interfaces are up, VRRP is healthy. But the WAN circuit behind it is broken at layer 3 due to a provider routing issue. VRRP does not fail over because, from its perspective, nothing is wrong with the router. Your users cannot reach the internet but the gateway is fine. You need object tracking or IP SLA probes tied to the VRRP priority to detect upstream failures and trigger a failover. This is configurable on most platforms but it is not the default, and a surprising number of VRRP deployments do not have it.
BGP multihoming: the right way to do WAN failover
For any site that justifies the investment, BGP multihoming is the correct approach to WAN resilience. You obtain your own IP address space (a /24 at minimum for it to be globally routable via BGP), your own AS number, and you establish BGP peering sessions with two or more upstream providers over your diverse circuits.
When a circuit fails, the BGP session over that circuit drops, routes are withdrawn, and traffic converges onto the surviving path. For outbound traffic, your router simply uses the remaining BGP-learned routes. For inbound traffic,, the internet's BGP routing tables update to reflect that your prefix is now only reachable via the surviving upstream. This is elegant and automatic, but with default timers it is painfully slow.
Default BGP uses a 60-second keepalive interval and a 180-second hold time. That means it can take up to 180 seconds for a BGP session to be declared dead after a failure. Three minutes of outage before failover even begins. For many applications, that is unacceptable.
BFD (Bidirectional Forwarding Detection) solves the detection speed problem. BFD runs alongside BGP and provides sub-second failure detection, typically configured with a 300ms detection time (100ms transmit interval, multiplier of 3). When BFD detects a path failure, it tears down the associated BGP session immediately, triggering route withdrawal and convergence. With BFD, local failover (your router shifting outbound traffic to the surviving path) happens in under a second.
But there is a catch. Local convergence is fast; global convergence is not. When your BGP session drops and routes are withdrawn, that withdrawal has to propagate through the internet's routing tables. Your upstream provider withdraws the route to their peers, who withdraw it to their peers, and so on. BGP convergence across the global routing table can take 2-5 minutes in the worst case, with intermediate states where some parts of the internet can reach you and others cannot. During this period, inbound traffic may be partially black-holed.
This is a fundamental characteristic of BGP and there is no configuration knob that fixes it. You can mitigate it by keeping both paths active (advertising your prefix to both upstreams simultaneously with AS path prepending to prefer one) so that the backup path is already in the global routing table and just needs to become the preferred path rather than appearing from scratch. This reduces global convergence time significantly because the alternative route already exists; it just needs to be promoted.
The testing gap
Here is the uncomfortable truth about failover: most organizations have never tested it under production conditions. The failover was configured during the initial deployment, someone unplugged a cable, watched traffic shift, plugged it back in, and declared success. That was two years ago. Since then, the firewall policy has been updated forty times, new applications have been deployed, the BGP configuration was tweaked for a traffic engineering change, and firmware was upgraded on both routers.
Nobody has tested whether the failover still works. Nobody has measured the actual convergence time. Nobody has verified that every application survives the transition, that stateful firewall sessions are maintained, that the VoIP system handles the gateway change, that the backup circuit can actually carry the full production load.
The organizations that take resilience seriously schedule failover tests quarterly. They pull the primary circuit during business hours, with a maintenance window communicated to the business, and they observe exactly what happens. They measure the detection time, the convergence time, the application impact. They check whether the backup circuit's bandwidth is sufficient for the actual production load (it is often undersized). They verify that traffic returns cleanly to the primary when it recovers, because failback can be just as problematic as failover, especially with BGP where route dampening can delay re-advertisement.
Every organization that runs these tests finds problems. A static route that was deleted during a maintenance window. A BFD session that was disabled during a firmware upgrade and never re-enabled. A backup circuit that has been down for three months with no alarm because the monitoring was only watching the primary. An asymmetric routing issue that causes the stateful firewall to drop return traffic during failover. These are not hypothetical. They are the standard results of a failover test on a network that has not been tested recently.
Bonded cellular as a resilience mechanism
Bonded cellular has matured significantly as a WAN backup technology, and it deserves a realistic assessment rather than the marketing version.
The basic concept: multiple SIM cards across multiple mobile network operators, with their bandwidth aggregated (bonded) into a single logical tunnel. Products from vendors like Peplink, Cradlepoint, and Mushroom Networks can bond four to eight SIMs simultaneously, providing aggregate bandwidth that is the sum of the individual connections minus bonding overhead. The bonding is typically done via a tunnel to a cloud-hosted aggregation point, which introduces latency but provides a stable public IP and clean failover behavior.
Real-world performance in a well-provisioned deployment with four SIMs across two operators: 80-150 Mbps download, 30-60 Mbps upload, with latency of 25-50ms plus whatever the tunnel endpoint adds. That is usable for most enterprise applications. It is not a replacement for a gigabit fiber circuit. It is a credible backup that keeps the business operational while the primary is restored.
The failure modes are worth understanding. Cellular performance degrades during peak usage periods, not because the radio link is saturated but because the backhaul from the cell tower is congested. Cell towers connect back to the mobile operator's core network via fiber or microwave, and that backhaul is a shared, contended resource. During a major local event, a network outage affecting many businesses (driving everyone to cellular), or simply during the afternoon peak on a congested urban cell, performance can drop substantially.
There is also the physical infrastructure overlap issue. Cell tower backhaul is often delivered over the same fiber duct infrastructure as your wired circuits. If a duct cut takes out your fiber circuits, it may also take out the backhaul to your nearest cell tower. The RF link from your building to the tower is independent, but the tower's connection to the wider network is not. This is a real limitation and it means bonded cellular is not perfectly diverse from fiber, but it is substantially diverse, which is still far better than two fiber circuits through the same duct.
For resilience purposes, bonded cellular works best as an active-passive backup that is monitored continuously and tested regularly. Keep it in standby, run synthetic probes over it to verify connectivity, and fail to it when the primary dies. Using it active-active with fiber is possible but introduces complexity around asymmetric paths, MTU differences (the bonding tunnel reduces effective MTU), and variable latency that some applications handle poorly.
Satellite failover: GEO reality and LEO promise
Satellite connectivity as a WAN backup falls into two very different categories, and the distinction matters enormously.
Geostationary (GEO) satellite (services like Hughes, Viasat, and traditional VSAT) sits at 36,000 km altitude. The physics are unforgiving. Round-trip latency is 550-650ms minimum, and that is before any network processing. TCP performance over GEO satellite requires optimization (WAN acceleration, TCP spoofing, pre-fetching) to be usable for web browsing. VoIP works with careful jitter buffer tuning but the conversational delay is noticeable and uncomfortable. Real-time applications (video conferencing, remote desktop, anything interactive) perform poorly. Bandwidth is typically 10-50 Mbps down, 2-5 Mbps up, and heavily contended.
GEO satellite is a last-resort backup. It keeps email flowing and allows basic web access. It does not maintain normal business operations for most organizations. If your resilience design includes GEO satellite as the backup, be honest with the business about what "operational on backup" actually means in practice. It means degraded operations, not business as usual.
Low Earth Orbit (LEO) satellite (Starlink being the dominant player, with OneWeb and Amazon's Kuiper in various stages of deployment) operates at 540-570 km altitude. Latency is 25-60ms, which is comparable to a terrestrial connection and genuinely usable for all standard business applications including VoIP and video conferencing. Bandwidth on Starlink Business is typically 100-250 Mbps down, 20-40 Mbps up. That is a credible WAN connection, not a degraded fallback.
LEO satellite has meaningful advantages as a resilience mechanism. The path diversity is about as complete as you can get (the signal goes to space and back via a ground station that is typically hundreds of kilometers away from your site. A duct cut, a local exchange fire, a regional fiber failure) none of these affect the satellite path. Its failure modes are entirely independent from terrestrial networks: weather (particularly heavy rain or snow on the dish), orbital coverage gaps (less of an issue as the constellation grows), and ground station outages.
The practical limitations are real though. Starlink does not currently offer static IP addresses on the standard business tier. You get CGNAT, which breaks site-to-site VPNs and inbound services. Starlink Business with static IP is available but at a significantly higher price point. Throughput is variable, not committed: you might get 200 Mbps at 2 AM and 80 Mbps at 2 PM. And the service operates under a fair-use policy; sustained high-bandwidth usage will get throttled.
For a WAN backup that needs to support normal business operations during a primary circuit failure, LEO satellite is currently the most compelling option where physical fiber diversity is not achievable. The combination of a primary fiber circuit and a Starlink backup provides genuine technology diversity with performance that keeps the business running normally rather than limping along.
The cost argument: when resilience is worth the premium
Resilience costs more than redundancy. Physically diverse fiber circuits cost more than two circuits through the same infrastructure, because they require more civil engineering. BGP multihoming requires provider-independent address space, an AS number, and routers capable of holding a full BGP table (or at least partial tables from multiple upstreams). Bonded cellular means ongoing mobile data contracts. Satellite means dish installation and a monthly service. The total cost of a genuinely resilient WAN design can be several times the cost of a single circuit.
The question is not whether resilience is expensive. It is. The question is what a connectivity outage costs the business, and whether the resilience premium is less than the expected annual loss from outages.
For a corporate office where staff can work from home during an outage, the cost of downtime might be modest. Some lost productivity, some inconvenience, a few hours of disruption. A single well-provisioned circuit with a cellular backup might be entirely adequate. The investment in full BGP multihoming with physically diverse fiber paths would be hard to justify.
For a distribution center where warehouse management, pick-to-light systems, and shipping integration all depend on connectivity, an outage halts operations. Staff cannot pick orders, trucks cannot be loaded, revenue stops. The cost of an hour of downtime might exceed the entire annual cost of a resilient WAN design. Here, the premium is trivially justified.
For a broadcast facility feeding live content to air, the cost of a connectivity failure is measured in seconds. Every second of black screen is a regulatory issue, a reputational event, and possibly a contractual penalty. These sites need the full resilience stack: diverse fiber, cellular backup, satellite as a tertiary path, sub-second failover, and continuous monitoring. The cost is significant and the alternative is unacceptable.
The mistake organizations make is applying the same resilience standard to every site. A 200-person corporate office does not need the same WAN architecture as a Tier 1 data center. Over-engineering the office wastes money. Under-engineering the data center creates existential risk. Resilience design should be proportional to the business impact of failure, and that requires having an honest conversation with the business about what downtime actually costs, not in IT terms but in operational and financial terms.
A testing methodology that reveals real problems
If you are serious about resilience, you need a testing program that goes beyond "unplug the cable and see if it fails over." Here is a methodology that consistently uncovers the problems hiding in production networks.
Phase 1: Document what should happen
Before testing anything, write down exactly what you expect to happen when the primary circuit fails. Which protocol detects the failure? What is the expected detection time? Where does traffic go? What happens to existing TCP sessions? What happens to VoIP calls in progress? What applications need to reconnect? What DNS changes are needed? What is the expected failback behavior when the primary recovers?
This exercise alone often reveals gaps. "We expect VRRP to fail over the default gateway." But do you have object tracking configured so VRRP detects an upstream failure, or only a local router failure? "BGP will reconverge." But do you know your actual BGP timer configuration, and have you confirmed BFD is operational? Writing down the expected behavior forces you to understand the actual configuration, and the actual configuration frequently does not match the original design.
Phase 2: Baseline measurement
Before inducing any failure, measure the current state. Run continuous pings to an external target from a representative endpoint on the LAN. Start a long-running iperf or similar throughput test. Set up packet captures at the LAN edge and on the WAN routers. If you have VoIP, place a test call and keep it up throughout. Note the current path. Which circuit is carrying traffic, what does the routing table look like, what does the BGP table show.
This baseline lets you measure the actual impact of the failover rather than relying on subjective observation. "It seemed like it failed over quickly" is not a measurement. "Ping loss was 4 packets at 1-second intervals, so failover completed in under 5 seconds" is a measurement.
Phase 3: Controlled failure
Simulate the failure as realistically as possible. Do not just shut down the interface on your router. That tests local detection but not the more common scenario where the circuit fails somewhere between your premises and the provider. Instead, have the carrier put the circuit into a test state, or disconnect the patch cable at the carrier demarcation point, or (if you can) shut the port on the provider's CPE. You want to test the actual failure detection mechanism, not the easiest-to-detect failure mode.
Better yet, test multiple failure types. An interface-down failure (cable unplugged) is the easiest to detect. A layer-3 failure with layer-1 still up (provider routing issue) is harder and more common. A partial failure with packet loss but not a complete outage (congested or degraded circuit) is the hardest to detect and the most insidious in production. Your failover design should handle all of these, and your testing should verify that it does.
Phase 4: Measure the impact
During the failure, measure everything. How many pings were lost? How long was the throughput test disrupted? Did the VoIP call survive? Did HTTPS sessions reconnect automatically or did users need to refresh? Did the VPN tunnel re-establish? How long did it take for inbound traffic (if BGP) to shift to the backup path? Check application logs for timeout errors or session failures.
Pay particular attention to asymmetric failures. After failover, outbound traffic may take one path while inbound traffic takes another, because BGP convergence is not instantaneous. Stateful firewalls will drop traffic that arrives on a different interface from the one the session was established on. This is one of the most common post-failover problems and it is invisible unless you are looking at firewall session tables and packet captures.
Phase 5: Test the failback
Restore the primary circuit and observe the failback. This is where many designs fall apart. BGP route dampening can delay re-advertisement of the restored path. VRRP preemption may or may not be configured. Traffic may split between paths during convergence, causing asymmetric routing again. Some applications that reconnected during failover may not handle a second disruption during failback gracefully.
Measure the failback the same way you measured the failover. In some environments, the failback is more disruptive than the original failure, because the failover was designed and the failback was an afterthought.
Phase 6: Document and remediate
Write up the results. Not a summary, but the actual measurements. Detection time, convergence time, packet loss, application impact, things that worked, things that did not. Assign actions for everything that did not work as expected. Then schedule the next test. Quarterly is the minimum frequency for any site where connectivity matters.
Network resilience is not a design exercise. It is an operational discipline. The architecture creates the potential for resilience. Testing, monitoring, and process determine whether that potential is realized. The organizations with the most resilient networks are not the ones with the most expensive equipment. They are the ones that have tested their failover under realistic conditions and fixed the things that broke.
Monitoring: the resilience you cannot see
A resilient network that you cannot monitor is a network where failures hide. The backup circuit developed a fault six weeks ago, but nobody noticed because traffic was flowing on the primary. Then the primary fails. The failover triggers. Traffic moves to a backup that has been broken for six weeks. Now you have zero working paths instead of one.
This happens constantly because monitoring typically focuses on the active path. SNMP polls the primary router's interfaces, the NMS shows green, everything looks healthy. The backup router's WAN interface has been down for a month but it is not being polled, or the alarm was acknowledged and forgotten, or the monitoring system only alerts on transition from up to down and the interface was down when monitoring was configured so it was never "known good" in the first place.
Resilience monitoring requires active synthetic testing on every path, including paths that are not carrying production traffic. Run pings, traceroutes, and HTTP probes across every circuit continuously. If the backup is broken, you need to know today, not during a failover at 3 AM. Set up specific alerts for the scenario where a backup path fails. This is arguably more important than alerting on a primary failure, because a primary failure is immediately visible to users while a backup failure is invisible until you need it.
Out-of-band management access is essential. If your monitoring platform reaches the network through the same paths it is monitoring, you lose visibility exactly when you need it most. A dedicated LTE management connection at each site, carrying no production traffic, exists solely so the operations team can see what is happening when everything else is down. It is often the least expensive circuit at a site and the most valuable during an outage.
Building resilience that works
True network resilience is a stack of decisions, each building on the one below it. Get the foundation wrong and nothing above it matters.
Start with physical diversity. Do not trust carrier diagrams. Request duct surveys from the infrastructure owner. Confirm building entry points. Know where your cables actually run, not where a provisioning system says they run. If true physical diversity is not achievable, acknowledge the gap and compensate with technology diversity.
Add technology diversity where it matters. Combine fiber with cellular, satellite, or fixed wireless. Ensure your backup uses a fundamentally different physical medium and a fundamentally different infrastructure path. A LEO satellite link shares nothing with your fiber: not the duct, not the exchange, not the backhaul, not the weather vulnerability. That independence is the entire point.
Design failover deliberately. Choose your protocol (BGP with BFD for WAN, VRRP with object tracking for LAN gateway) and configure it for the detection speed your most sensitive application requires. Do not leave default timers in place. Understand the convergence behavior, including the global BGP convergence time for inbound traffic that you cannot control.
Eliminate hidden dependencies. Audit DNS, NTP, authentication services, monitoring, and management access. Any shared dependency between your primary and backup paths is a shared failure mode. Your backup circuit is not useful if it cannot resolve DNS because your resolvers are only reachable via the primary.
Monitor everything, especially the things not carrying traffic. Continuous synthetic testing on all paths. Out-of-band management access. Alerts on failover events, not because the failover failed but because you are now running without redundancy and the primary needs urgent restoration.
Test under realistic conditions on a regular schedule. Quarterly failover tests, during business hours, with measurements. Fix what breaks. Test again. The problems you find during a planned test are the problems you would have found during a real outage, except during a planned test you can fix them calmly instead of discovering them while the business is bleeding.
Redundancy is buying two of something and putting it on the risk register as mitigated. Resilience is engineering a system that survives the failures you have anticipated, degrades gracefully under the ones you have not, and is continuously validated through monitoring and testing. The difference is not in the equipment budget. It is in the design thinking, the verification discipline, and the willingness to test your assumptions before reality tests them for you.