Continuity & Resilience • Article
After what happened, the question is no longer “what could happen” – it is “how long will it take to restore?”
⏱️ Estimated reading time: 9 minutes
Hyperscaler dependencies, systemic failures and third-party concentration are changing the expectation: proving continuity and recovery capability, with evidence and testing.
Why talk about this now?
- Hyperscaler dependencies: the failure of 12/Jun/2025 affected multiple popular applications and exposed the cross-cutting impact of a single point of failure in identity/shared services. status.cloud.google.com
- Productivity impacts caused by SPOFs: in July, Outlook/M365 users were left without email/calendar access for hours until the necessary configuration corrective measures were applied. A reminder that, in order to work, organisations depend on critical SaaS services, such as simple applications that are often identified as “non-critical” in a typical impact assessment. AP News
- Failures in infrastructure and support services outside the organisation’s control, rather than in the organisation’s own IT. The systemic effect of the disruptive event challenged the resilience of the ecosystem in which many organisations operate: the Iberian blackout was classified by ERSE as ICS 3 – Blackout (the most severe ENTSO-E level), and showed that energy/telecoms are direct operational dependencies. Many organisations felt the real impact of a disruptive event that affected their entire ecosystem in a cascading manner. erse.pt
- European supervisors are warning about concentration risk and dependency: the concentration of many critical services in only a few IT providers, or the use by many Essential and Critical entities of services from only a few of the more visible and “reputable” IT providers in the market (because they promote higher levels of trust), operating in the EU (but, for the most part, “non-European”), increases risk and dependency for those entities, and highlights the criticality of those specific IT providers. Alert to this fact, European regulators and supervisors have been highlighting the need to strengthen the risk management, cybersecurity and control measures to be implemented by these parties, in order to reduce concentration and contagion risks across the entire dependent services ecosystem. Relevant examples include European directives transposed into national law such as NIS 2 and CER, and sector-specific “lex specialis” regulations such as DORA, for the financial sector and IT service providers, which has already applied since 17 Jan 2025. ESMA, as well as extensions and specific rules, such as in the aviation sector where the NIS 2 Directive (lex generalis) + EASA Part IS Rules (lex specialis) apply – including EASA Rules/Regulations (directly applicable or via implementing regulations).
“Continuity-by-design” is not just a dossier!
1) Multi-region, followed by multi-cloud
Start with redundancy within the same cloud: distinct zones/regions, tested failover and immutable data, Write Once Read Many (WORM), to support recovery. Move to multi-cloud where the impact justifies it (identity, billing, front-ends).
2) Identity with break-glass
If IAM goes down, everything goes down. Define break-glass accounts outside SSO, dual custody of keys and a rehearsed activation/revocation procedure.
3) SaaS “egress-ready”
For each critical SaaS service, answer: How do I export the data? How long does it take? Can I operate minimum services without the platform? Keep extracts/mirrors of what is vital for the customer.
4) Observability that does not go down with the incident
Logs, alerts and status outside the affected provider (external monitoring, runbooks and offline contacts).
5) Functional degradation plans
Define minimum viable services (what you keep running when you lose 30–50% of capacity) — and train feature toggling under pressure.
6) Contracts that genuinely help
Require RTO/RPO, early notification, right to audit and joint testing; measure response time and the quality of the evidence provided.
7) Recurring and measurable exercises
ENISA recommends testing BC/DR regularly (table-tops, simulations, hot sites, digital twins) with a record of lessons learned and reviews. ENISA
“Evidence” that reassures regulators and customers
- DORA (financial sector): evidence of ICT risk management, resilience testing and third-party governance models (reports, minutes, metrics). EIOPA
- Service concentration and third-party dependencies: demonstrate that you know who can stop you (3rd and 4th level), know who is in fact critical and how to reduce that exposure. ESMA
- Communication: ready-made playbooks (customers, media, supervisors) with target times and trained spokespersons.
60-minute checklist (when everything stops)
- Activate the runbooks and crisis bridge (out-of-band communication).
- Confirm the scope (what is down / what is working) and activate degraded mode.
- Decide within 15 minutes: automatic/manual failover, change freeze and the initial message to customers.
- Record everything (times, decisions, evidence).
- Update status hourly until stability is restored.
Metrics the Board understands
- RTO/RPO by service (target – goals vs. achieved in the latest test)
- MTTR by scenario (average and p95)
- Percentage of backups restored in the last 90 days
- EDR/SIEM coverage and % of runbooks successfully tested
- Dependency map (3rd/4th level) and time to first communication to customers
Questions you should ask your provider this week (after reviewing your contract)
- What is the contractual RTO/RPO and what has been observed in real tests?
- How can data be exported in an emergency scenario and how long does it take?
- What alternatives exist if SSO/IAM fails?
- If / How can a regional/service failure scenario be tested jointly? When?
- What evidence is delivered after the incident?
100-day plan (a new angle, without repeating what you already have)
0–30 days — Concentration risk snapshot
Map critical services ↔ regions/providers ↔ single points of failure (including energy/telecoms). Mark quick wins (break-glass, external status page). erse.pt
31–60 days — Rehearse and measure
Table-top “loss of IAM/cloud” + timed restore test; prepare dashboards (RTO/RPO, MTTR, communication). status.cloud.google.com
61–100 days — Prove and improve
Technical test (loss of GCP/Azure/M365) + contractual review of critical SaaS services (export, notification, right to audit) + after-action review. AP News
Useful reading (2025)
- GCP incident status (12/Jun/2025) and independent analyses. status.cloud.google.com
- Microsoft 365/Outlook outage (Jul/2025) — configuration fix and progressive restoration. AP News
- Iberian blackout classified as ICS 3 – Blackout (ERSE). erse.pt
- IT third-party concentration concerns European supervisors (JC ESAs). ESMA
- ENISA: test BC/DR regularly and document lessons learned. ENISA
Recommended Behaviour courses
- ISO 22301 Lead Implementer — BIA, RTO/RPO, runbooks and exercises
- ISO 22301 Lead Auditor — measure effectiveness and prepare audits
- NIS 2 Compliance Lead Manager — governance, reporting and third-party risk management
- DORA Compliance Lead Manager — digital operational resilience (financial sector)
Author: Behaviour
Published on: November 24, 2025
Copying or reproducing this article is not authorised.