How we do Resilience
Our products run on a platform as a service (PaaS) environment that is split into two main sets of infrastructure namely cloud infrastructure and blockchain as a service.
To this end, we work to minimise customer impact in the event of any disruptions. We leverage multiple geographically diverse data centres, have a comprehensive backup program, and gain assurance by regularly testing our disaster recovery and business continuity plans.
This page provides an overview of how we manage the overall lifecycle of customer data management, including backups utilising native capabilities in Microsoft Azure and Amazon Web Services (AWS) to ensure availability of our services, how we regularly test our disaster recovery plans, and our approach to continuous improvement of our disaster recovery and business continuity plans.
Infrastructure and databasess
Broadly speaking, the Rise-X EOP is split into two main sets of infrastructure where our products run: Our core platform runs primarily on Microsoft Azure / AWS Services and we run a private blockchain service hosted by Kaleido. As we are born in the cloud we run all our infrastructure as a service (IaaS) which gives us a natural advantage over software products that run on premise or using a hybrid model.
The Rise-X EOP is hosted in multiple Microsoft Azure and AWS regions, using the Microsoft Azure and AWS infrastructure as a service (IaaS) offering (specifically US-East, US-West, and Sydney, with plans to expand to other regions as necessary). All the data you create is stored in our Data Storage Layer (DSL) and your data is ultimately stored in MongoDB Atlas which we host on AWS.
Backups
Rise-X realises that whatever your business does it creates data, and without your data you don’t have a business. In line with our “trust” core value, we do not take for granted that you have put your trust in us and we take extra care not to break that trust. We care deeply about protecting your data from loss and have an extensive backup program.
To mitigate irrecoverable loss of data, we continuously and automatically record backups of your data at regular intervals. Backups are copies of your data that encapsulate the state of your data at a given time. Backups provide a safety measure in the event of data loss. We leverage the native snapshot capabilities of MongoDB Atlas to support full-copy snapshots and localised snapshot storage. For redundancy and to keep your business up and running, we use three live nodes which protect in the event of a single failure and support hot recovery for business continuity. And just in case we suffer a complete failure, backups are kept hourly for 3 days, daily for 14 days, weekly for 4 weeks, and monthly for 12 months using AES-256 encryption. These are rolling backups of the entire system meaning restoration of data will be to the hour in the event of catastrophic system failure (remember this is unlikely).
You can learn more about the native backup, restore and archive capabilities of the MongoDB service here: MongoDB Back Up, Restore and Archive.
If you have additional questions on our backup policy or you have strict data protection requirements beyond what we offer as standard, you can get hold of us at connect@rise-x.io and we are always open to feedback on where we could do better.
How we utilise multiple data centres in different geographic regions for high availability
With hurricanes, earthquakes, and tsunamis all remote, but non-zero, risks, it is imperative that data be backed up (and replicated) to a different geographical location so that data can be recovered, no matter what happens.
Rise-X does this by utilising Microsoft Azures and AWS highly available data centre facilities in multiple regions world-wide. Each Microsoft Azure/AWS region is a separate geographical location, which has multiple, isolated locations known as Azure Regions (Microsoft Azure) or Availability Zones (AZs) in the case of AWS. For example, in the case of AWS they have US-West (the West Cost of the United States) as a region, within which there are two AZs, us-west-1a (located in Northern California) and us-west-1b (located in Oregon), both of which are in the same overall region, but are geographically isolated. Microsoft Azure has a similar offering.
Each AZ is designed to be isolated from failures in other AZs, and to provide cost efficient, low-latency network connectivity to other AZs in the same region. This multi-zone high availability is the first line of defence and means that services running in multi-AZ deployments should be able to withstand AZ failure.
Rise-X utilises the multi-AZ deployment mode for all core services. In a multi-AZ deployment, Microsoft Azure provisions and maintains a synchronous standby replica in a different AZ of the same region to provide redundancy and failover capability. The AZ failover is automated and typically takes 60-120 seconds, so database operations can resume as quickly as possible without administrative intervention.
By exploiting these high availability strategies provided by Microsoft Azure and their global cloud infrastructure as a service (IaaS), we can confidently provide high availability for business continuity.
How we determine recovery time and recovery point objectives
In an ideal world, we would never lose any vital business data. In practice though, a system with zero risk of data loss is either unattainable or prohibitively expensive. While we aim for zero data loss in any scenario and the ability to automatically survive an availability zone failure, in business continuity planning it is necessary to set “recovery time objectives” and “recovery point objectives” (RTOs, and RPOs, respectively) that seek to find the right balance between cost, benefit and risk.
The RTO is the period of time after an incident, in which the business process (or system) should be recovered and back up and running.
The RPO is effectively the amount of data the organisation accepts it may lose in a recovery operation. In a simple example, if you recall from reading above, we take backups hourly, which we keep for 3 days. If the system has a catastrophic failure of all 3 nodes at 11:59:59 (just before noon) on any given day and recover from the backup (which was taken at 11:00), you’re going to lose 1 hour of data. That’s our maximum data loss risk and our standard RPO.
We continuously review and refine our RTO and RPO targets based on client user requirements and the potential impact of a disruption and we set RPO and RTO targets for each of the core services that make up the Rise-X EOP.
More specifically, we split our services up into easily understandable buckets which we call tiers. We have three tiers of service, each tier with its own RPO and RTO targets based on the “criticality” of that service to our customers.
Tier-0 services get the highest availability standards as they are critical components that all other services rely on. Tier 1,2 and 3 services are used to describe Rise-X business systems and internal tools and are less critical to the availability and reliability of our services to customers.
For each tier, we’ve defined mandatory targets by reviewing, amongst other things, business impact assessments and typical usage scenarios for the services we build. Our service tiers help determine availability, reliability, RTO and RPO targets as set out in the table below and may change from time to time.
Tier 0 | Tier 1 | Tier 2 | Tier 3 | |
Critical infrastructure and service components | Our Tier 0 services are those that form the basis of all other services and are critical to delivery of our products. | Our Tier 1 services generally are our products, or directly related to delivery of our products. | Tier 2 services are either non-critical or internal facing. | Tier 3 services are either non-critical or internal facing. |
Example Services: | Example Services
- Microsoft Azure / AWS
- Networking
- EOP API
| Example Services
- DevOps
- GitHub
- Notion | Example Services
- Figma
- Xero | Example Services
- Mixpanel
|
RPO* | <1 hour | <1 hour | <8 hours | <24 hours |
RTO** | <4 hours | <24 hours | <48 hours | <96 hours |
- RPO – Recovery Point Objective – data loss in event of disaster
- *RTO – Recovery Time Objective – services restoration in the event of a disaster
How we do disaster recovery testing
Rise-X conducts regular disaster recovery testing and strives for continual improvement as part of our Disaster Recovery (DR) Program. This seeks to ensure that customer data and services are reliable and resilient. We conduct both scheduled and ad hoc testing, including the following elements:
Documentation - For the critical/customer facing services (including Tier 0 and Tier 1), bi-annual reviews of backup documentation are undertaken for accuracy and completeness/currency. Any identified issues are documented and tracked through to remediation.
Process - Periodic tests of actual technical backup/recovery processes are also completed for critical/customer facing services (including Tier 0 and Tier 1), to determine whether RTO and RPO objectives are met (based on service tier classification). Any identified issues flowing from these tests are raised and tracked until it is remediated.
Resilience and Failover – Periodic and ad hoc tests for levels of resilience across AZs are undertaken to ensure Rise-X can handle an AZ failure with minimal downtime. While we understand a complete region failure is highly unlikely, we also periodically test region failover and continue to mature our regional resiliency.
Systems - The Rise-X Architecture Board (RAB) and product design and engineering teams continuously monitor a wide variety of metrics across the services to help ensure users have excellent experiences. Automated alerts are configured to notify members of the RAB team when certain thresholds for service metrics are crossed, so that immediate action can be taken within our incident response processes.
Rise-X Architecture Board (“RAB”) – The RAB are committed to ongoing periodic DR meetings They identify DR gaps, focusing on remediation as necessary.
Leadership - we maintain the involvement and ongoing engagement of executive and senior management in our DR processes. With leadership involved, Rise-X has both business and technical drivers accounted for in its strategy for resilience.
Initiatives under development
Disaster Recovery Dashboard - A DR dashboard is maintained internally so that for the critical/customer facing services (including Tier 0 and Tier 1), tickets relating to oversight, maintenance and testing can be tracked centrally to ensure that reviews of documentation and backup/recovery processes are completed on time.
DR Tests and Simulations – DR tests are performed on an annual and ad hoc basis. As part of our DR tests, we perform table top exercises to help the DR teams walk through various scenarios of potential incidents. Table top exercises test different scenarios and identify gaps in our recovery processes. Scenarios for table top exercises include earthquake, fire, natural disaster, recovery drills and tests. After DR tests are performed, outputs of the tests are captured, analysed and discussed to determine the scope of the next steps for continuous improvement.
Rise-X realises that whilst our testing and processes are technically rigours, we still set the standard of having exceptional people bringing it all together. Accordingly, Rise-X includes the following people elements in our DR Program:
Disaster Recovery Champions - DR champs are appointed within each product/service team (including underlying services) to oversee and help manage the implementation of DR within that product/service to ensure it meets service tier requirements.
Other broader business continuity measures and plans
Rise-X strives to maintain strong Business Continuity (“BC”) and DR capabilities to ensure that the effect on our customers is minimised in the event of any disruptions to our operations. The key principles guiding our BC and DR program include:
Continuous improvement – Rise-X strives to ensure improvements to resilience grow through operational efficiencies, automation, new technologies and proven practices.
Assurance through testing – Rise-X understands that through regularly scheduled testing and the application of continual improvements, we are able to achieve optimal resiliency.
Dedicated resources – Rise-X has dedicated resources to ensure our customer-facing products get the attention they need to make the Business Continuity and Disaster Recovery a minimum expectation of products and services and a core pillar of our customer value proposition. Though we are few, Rise-X has the right skills and experience on the team to respond accordingly to real world incidents that affect global enterprise.
In summary
Rise-X combine best in-class technologies and on-going testing and validation to ensure our customer data is highly available, reliable and resilient. We operate multiple geographically diverse data centres, have an extensive backup program, and gain assurance through regularly testing disaster recovery and business continuity plans. To top it all off, we have exceptional people and dedicated resources bringing our processes together and we are committed to the relentless pursuit of better every day.