At AWS re:Invent in November, Felix Candelario and Benjamin Feldon from Amazon Web Services (AWS) gave a presentation titled “Disaster Recovery & Business Continuity for Financial Institutions.”

Disaster Recovery is an issue of particular importance to security exchanges such as NYSE, NASDAQ, and the London Stock Exchange and it is regulated by the Security Exchange Commission and the Regulation Systems Compliance and Integrity (Regulation SCI) rules. The possibility of a disaster that could take down both production and backup server environments would be catastrophic, affecting customers world-wide. Because of the severity of such a catastrophic event, and to meet regulatory requirements set forth by RegSCI, a disaster recovery plan must be in place to mitigate this scenario within two hours of downtime.

Generally, a good recovery plan has the main objective to maintain a nearly instantaneous recovery point objective (RPO) to recovery time objective (RTO).


This slide is from the presentation AWS re:Invent 2016: Disaster Recovery & Business Continuity for Financial Institutions (FIN302)

Felix and Benjamin demonstrate how creating a financial disaster recovery plan with an RPO of zero is, indeed, possible. This talk uses the fictitious AEX stock exchange where the cloud architecture includes a number of customer gateways as well as market data and matching engines. This hypothetical exchange has 100 listed symbols and 100 Broker/Dealers with each Broker/Dealer supporting 100 customers. In a typical scenario, every customer sends a Buy or Sell order for a random symbol for a random quantity every second.

The goal of this particular disaster recovery plan is to replace business continuity planning (BCP) meetings that occur multiple times per year at stock exchanges with an automated and repeatable recovery plan. Architects can write one version of the code for their disaster recovery (DR) plans, optimize it over time, and end with a modern DR system that is automated, audited, and elastic.

Felix and Benjamin’s example uses the following AWS Tech stack:

  • AWS CloudFormation automates infrastructure setup
  • Troposphere generates CloudFormation templates
  • Amazon EC2 container service to manage containers
  • Weave Net provides a container network overlay and multicasting
  • Amazon Route 53 for service discovery (this tool tells the customer gateway which multicast address to send orders to)
  • Amazon S3 for object storage (this is the actual front-end view for the exchange, where bids, asks, and trades can be seen coming in)
  • Amazon Kinesis Firehose captures that streaming data from S3 into a service bucket


Weave Net’s Role in Disaster Recovery

Because of Weave Net’s ability to network across clusters regardless of data center or cloud provider, Weave’s resilient container networking is an integral part of the disaster recovery strategy. In addition to this, Weave Net provides Multicasting, a function that is not natively available on Amazon Web Services. For the AWS team, using Weave Net’s multicasting is a cleaner solution than refactoring code to make multicasting possible, which can be a time-consuming and complicated process.

The environment for AEX is dynamic and reliant on automated cloud technologies. Thus, it needs to be able to handle high-volume multicast feeds. Though Weave Net has automatic service discovery capability, Weave only provides the cluster-to-cluster overlay multicast network in this scenario. Amazon’s Route 53 provides service discovery, creating DNS records for each symbol within the customer gateways. Weave Net’s multicast capability delivers the symbol and pricing information to the customer and broker gateway cluster.

Matching engines resolve DNS entries associated with a particular symbol, subscribing them to the multicast address associated with that symbol. When customer gateways or matching engines resolve, that particular symbol receives the unique multicast address that is either published or subscribed to.

Here are the key takeaways from Felix and Benjamin’s presentation:

  • The more closely that you couple storing your state both locally and remotely, the lower the RPO is going to be if and when it does go down
  • You should treat the replication mechanism as part of your production, this allows you to tie DR into your CI/CD pipeline
  • This plan drastically reduces Disaster Recovery costs, since you only pay for the Disaster Recovery environment once it’s up and running
  • The Disaster Recovery system can be spun up in as little as 7 minutes when done manually, though this can also be automated (spinning up virtually instantly)
  • RPO is zero since the DR system picks up exactly where production left off
  • This environment is reusable, so your development team can use CloudFormation templates to spin up a sandbox replica of your production environment
  • You’re not bound by geography, your DR environment can be spun up thousands of miles away to ensure a localized disaster doesn’t keep your business down for long
  • A DR environment that only runs when necessary means you have a reduced attack surface

Multicasting with Weave Net

Weave Net is unique because it enables multicasting in the cloud though all current cloud providers do not support multicast. Weave Net works in software to bypass these multicast restrictions. Learn more about Weave Net’s multicast here.

To learn more about the inner workings of this disaster recovery plan, watch the full presentation here: