NVMe over fabrics on steroids with Excelero NVMesh

Publiziert von Sebastian Schmitzdorff am  22. März 2017

I’ve been working with solid state disk technology since 2006.

Back then we were using RAM based storage appliances with active disk backup and an integrated UPS to guard against power outage. At the time we were talking of capacities up to 128GB. Connectivity was provided via 4Gbit Fibre Channel or DDR InfiniBand with 20Gbit/s (16Gbit/s net throughput).
The controller was based on some FPGA and a custom operating system.

The company was also building their own FC controllers, which enabled lower latency than off the shelf adapters from Qlogic and Emulex. The approach was good and simple with a high performance and very low latency.

However I was missing two key features.

  • redundancy
  • Scalibility

A redundant setup over two storages was only possible by adding extra layers in the datapath by the likes of DataCore and IBM SVC. This was for redundancy and transparent failover. Scaling was done by adding more storage targets to the storage virtualization server, creating a performance bottleneck at the same time.

So this was bought at the price of adding extra complexity and extra latency.
Not very appealing to me.

Many years passed. Many new solutions came up and vanished. Some addressed the redundancy problem. I don’t think any of them addressed the scalability topic, at least not while still providing low access latency, high iops and high throughput at the same time.

NVMe over fabrics is all the hype these days. Accessing NVMe drives over a low latency RDMA fabric. This can be standard Ethernet or InfiniBand. Some are also using PCI-Express and FibreChannel based Networks.

By using 100Gbit Ethernet we can reach a throughput of over 10GB/s and an Access latency almost in the nanosecond range (on the network side). In real life NVMe over fabric is adding roughly 5µs (microseconds) to the latency of the NVMe drives. So depending on the NVMe drive we are somewhere in the range of 25µs access latency. And that’s over a standard (RDMA enabled) ethernet network.

This is really great and I love it. Finally my beloved RDMA technology gets some attention.

But wait, what about scalability?
Along comes a new vendor.

What do they do differently?
They address four key requirements of these days.

  • Scalability
  • Redundancy
  • Low latency, High IOPS and Throughput
  • Software Defined (meaning you can run it on any commodity server hardware)

The amazing thing is, they only launched out of stealth modern mach 8th and already have customers who run it in production.
They have combined two major technologies.

RDMA which enables remote direct memory access, without using the CPU, and NVMe which enables the operating system to bypass the CPU while accessing the NVMe flash device.

The result is called RDDA (Remote direct disk access).

Disk being the NVMe flash device. Using both benefits of RDMA and NVMe, the flash device can be accessed over the network without utilizing the CPU of the storage target. Everything is centrally managed with a modern state of the art Web UI. The management is also scalable and redundant.

They have their own client software with integrated multipathing. You can imagine I couldn’t wait to get my hands on this. The installation was very easy and straightforward. Once installed, I was welcomed by a modern Web User Interface.

By the way the Web UI is designed, it is easy to tell that it was created with scalability in mind. The management server itself is independent from the datapath and you can run a cluster of multiple management servers.

• The configuration is divided into target nodes, client nodes and volume configuration.
• Volumes can span multiple targets and you can choose from RAID 0, RAID 1, RAID and JBOD policies.
• You can also create various classes for targets, clients and flash disks.
• It gives the impression that one can easily manage thousands of targets and clients.

I have to admit that my testbed environment was not state of the art.

An ivy bridge Xeon server, with two older HGST 1.6TB NVMe drives for the target. Mellanox Connect-X 3 Pro 40Gbe nics, an older Mellanox SX1024 Switch and a few Sandy Bridge Xeon clients.

So seeing 1.2 million IOPS from two NVMe flash drives was one thing, but having literally no CPU load on the target at the same time was absolutely stunning.