Excelero Storage launched their NVMesh product back in March 2017 at Storage Field Day 12. NVMesh is a software defined storage solution using commodity servers and NVMe devices. Using NVMesh and the Excelero RDDA protocol, we saw some mind blowing performance numbers, both in raw IOps and in latency, while keeping hardware and licensing costs low.
Excelero is a young company, founded in 2014 in Tel Aviv, Israel. Their mission is to build efficient, high performance, scale-out storage infrastructure. They currently focus on web-scale cloud and service providers and financial services.
NVMesh
Let’s dive right in: NVMesh is a software defined block scale-out storage, using commodity servers and NVMe drives. There are many software defined storage systems out there, but according to quite a few Storage Field Day delegates, this might be the truest one. The reason why is that the control and data plane are completely separated, and the NVMesh software is not in the data path itself.
The key components of NVMesh are:
- Centralized Management. This controls both the clients and the targets, and serves as a single point of contact for all your orchestration tools to assign storage. It has a fancy GUI that is mobile/tablet compatible and a RESTful API.
- The client side injects a driver in your OS and will present the storage to your application. It could be using a local NVMe device, or it could be using a remote one: for the application this will be transparent.
- The target side that’s presenting out the NVMe drive(s) for local and/or remote access. As mentioned, the NVMesh software is not in the way of the data path. Instead, an IO will travel from client to target through the RDMA NICs and, using the Excelero RDDA protocol, circumvent the CPU and hit the NVMe drive directly. The target module is strictly there to set up the connections to the NVMe devices, and afterwards doesn’t play a role anymore in the data path.
All data services are located on the client side of NVMesh. As a result, your target will not be hindered by operations initiated by the client. According to Yaniv Romem, this is a fundamental difference compared with other SDS solutions, and means that you can continue to run applications on your NVMesh target in a converged manner.
The NVMesh software is currently Linux only, and the cluster grows in a RAIN (Redundant Array of Independent Nodes) scale-out fashion. Of course you can add multiple NVMe devices to a single server as well, ticking the scale-up box. Since the CPU is no longer a bottleneck, this is a viable option to get some more bang for your buck and saturate those RDMA NICs (Mellanox and Qlogic are supported). You could also opt to use some slow/low-core-count CPUs in the target nodes, but keep an eye on the number of PCIe lanes between NVMe device and RDMA NIC.
NVMesh can run in a converged, disaggregated or mixed fashion, combining or separating the client and target modules where needed.
The “4.5 million IOps @ 200µs” demo
During the presentation, Excelero demonstrated their software. The demo environment consists of 4 converged nodes on 2U Supermicro 2U TwinPro server, and 1 disaggregated node on a Supermicro 2028 server. Connectivity was supplied with a Dell 32port 100GbE switch. The servers were filled mostly with Intel DC P3600 400GB NVMe drives (rated for 320K 4K read IOps each) and a Samsung PM953 480GB NVMe drive (because they can). For CPU’s, some not-so-spectacular Intel processors (2x E5-2620 v4 8-core @ 2.1Ghz for the converged nodes, and 2x a E5-2630 v4 10-core @ 2.2Ghz for the disaggregated node). Plus 64GB RAM for each server. This results in the following demo environment:
Using this environment, Josh Goldenhar created a couple of volumes and ran IO against it, both locally and remote. The key takeaways are:
- Installing a new client & target is a 30s job, using YUM packages.
- Using policies, you’re able to make a distinction between converged and disaggregated NVMesh nodes. You could extend this policy to further specify which nodes to pick, e.g. add a datacenter or row tag.
- 89µs local IO response time for a 4K read. 95µs if you accessed the volume remotely over the RDMA network. This means a 6µs network overhead.
- Both local and remote systems are able to saturate a single 320k IOps NVMe device, at roughly 300µs.
- When accessing a volume using 16 out 24 Intel NVMe drives in big24, Excelero NVMesh was able to achieve a phenomenal 4.5 million 4K read IOps, at roughly 200µs.
All at low cost
The kicker? The big24 system, with its cheap CPUs and 24 NVMe devices, costs $13000. NVMesh licensing is on a NVMe drive basis, which enables you to easily upgrade your devices in the future. There’s different licensing prices for both the target and client nodes; if you go converged, it’s a combined price. And the licensing cost per device is low…
My thoughts on Excelero NVMesh
I’m really impressed with the performance, mature GUI and clear messaging from Excelero, especially since this is a relatively young company and a v1.1 product. There’s very little marketing fluff around these guys, and a strong “no BS, let’s get it done” vibe.
Earlier in this post I wrote about the data services being located at the client side. So far, there’s not too many of those data services yet: there’s no deduplication, thin provisioning, etc. That said: Excelero just came out of stealth and is still focusing on raw performance on commodity hardware, so that’s easy to forgive. “Walk before you can run”, or in Excelero’s case: “run before you can fly”.
I’m looking forward to hearing more from Excelero in the near future, maybe even at a Storage Field Day in Israel. There’s a lot of storage startups coming from that area of the world. The roadmap contains exciting new things such as Non-Linux OS support, QoS and Advanced Data Services. In the meantime, I’ll have to find a customer or environment that actually needs this many IOps…
Many of the delegates were very impressed with NVMesh. Read Ray’s very positive post over here, and Chan wrote a short novel which I can highly recommend if you want to dive into the protocols and technology. You can watch the Excelero presentations at the Storage Field Day 12 website.
Disclaimer: I wouldn’t have been able to attend Storage Field Day 12 without GestaltIT picking up the tab for the flights, hotel and various other expenses like food. I was however not compensated for my time and there is no requirement to blog or tweet about any of the presentations. Everything I post is of my own accord and because I like what I see and hear.