Once upon a time there was a data center filled with racks of physical servers. Thanks to hypervisors such as VMware ESX it was possible to virtualize these systems and run them as virtual machines, using less hardware. This had a lot of advantages in terms of compute efficiency, ease of management and deployment/DR agility.
To enable many of the hypervisor features such as VMotion, HA and DRS, the data of the virtual machine had to be located on a shared storage system. This had an extra benefit: it’s easier to hand out pieces of a big pool of shared storage, than to predict capacity requirements for 100’s of individual servers. Some servers might need a lot of capacity (file servers), some might need just enough for an OS and maybe a web server application. This meant that the move to centralized storage was also beneficial from a capacity allocation perspective.
So what happens when you run a whole bunch of virtual machines on a centralized storage? A lot of I/O is sent to that centralized storage system. Since it’s no longer a few sequential streams but many hundreds of mixed streams, I/O to the back-end is often completely random. Latency suffers as a result.
Flash media to the rescue!
Flash solved many of these problems – and created a problem of its own when central storage system controllers were unable to keep up with the flash media speed. Back-end media latency dropped significantly and throughputs went up. And as usual when you remove one bottleneck, the next one pops up: connectivity. Both FC and Ethernet add latency (a few ms) and have/had limited throughput (4-8Gbit FC & 10GbE back then).
So what if you want even lower latencies and higher throughputs? You add the super fast storage to the server. For example: Fusion-io accelerator cards, or NVMe flash devices. Both plug straight into the PCIe bus in the server and the performance is staggering. Throughput jumps to 100’s of thousands of IOPs, several GB/s of bandwidth and latency in the microseconds. This is so effective that Wikibon expects a massive shift from traditional enterprise storage systems, to server SAN based solutions. Coincidentally, this matches with EMC’s 2020 prediction that all production applications will be flash-based.
One problem: what if you don’t use the full potential of that insanely fast NVMe device in the server? We’re back at the original situation where we have 100’s of small silos of data across the datacenter, not using them efficiently…
SPDK and NVMe over Fabrics (NVMe-oF)
Thanks to fast networks such as 100GbE and Infiniband, we can now efficiently share those NVMe devices out to other servers. This is NVMe-oF or NVMe over (RDMA) Fabrics. Don’t be mistaken by the “Fabrics” part: it’s not just Fibre Channel that we can use for NVMe-oF. In fact Ethernet will probably take the lead due to the significantly higher speeds available.
At Storage Field Day 12 we visited Intel and talked about their SPDK software stack. Lets explain the above picture about the NVMe-oF latency model:
- An IO read is generated by the initiator at START and traverses the network to the target (the grey block). This takes approximately 6,5µs.
- The Intel SPDK stack adds 0,2µs on the incoming read.
- The NVMe controller currently takes roughly 80µs to fetch the data from the flash media.
- Data traverses the SPDK stack again, adding 0,1µs.
- Plus the last hop over the network adding 7µs.
So what does this mean? Basically: it’s entirely possible to share your lightning fast NVMe device over the network. Yes, you’ll pay a small latency penalty, but it’s in the region of ~15µs per IOP, which isn’t that bad. Even better: new flash media with even lower latencies are around the corner. Intel Optane SSDs using 3DXpoint technology are rumoured to give media latencies in the range of ~10µs, which means end-to-end latency is going down even further.
The SPDK stack has an additional advantage: efficiency.
Compared to the standard Linux kernel approach, SPDK offers a 10x CPU efficiency gain. Where the kernel drivers would saturate 30 CPU cores to transport 3,5M IOps across three saturated 50GbE links, the SPDK stack would only saturate 3 CPU cores. This is great news for those vendors deploying HCI solutions, as this means there will be more CPU cores available to run virtual machines or containers.
My thoughts on NVMe-oF and Intel SPDK
I’ve mentioned before that hardware is speeding up and that software had to catch up to bring overall latencies down. With Intel SPDK, sharing an NVMe device across an RDMA fabric will be more accessible than ever. Both CPU overhead and controller/software latency go down, reducing the penalty to access a remote NVMe device. This paves the way for server SAN and HCI architectures. In the end I think this will help accelerate the adoption of NVMe devices and new flash media types as well. This will bring even lower latencies and higher storage performance to applications. Or in the words of Jonathan Stern from Intel: “It’s raining IOPS, halleluja!”
I can highly recommend watching the Intel Storage Field Day 12 videos, which you can find over here. Apart from the video with Jonathan, there’s also a video where Tony Luck presents some of the low level processor instructions used by Intel’s Resource Director Technology. As usual, we bottomed out during this deep dive, so let me know what you think! Chan and Glenn have also blogged about their experiences at Intel. And last but not least: check out the SPDK website and maybe download the libraries for your own server SAN/HCI solution.
Disclaimer: I wouldn’t have been able to attend Storage Field Day 12 without GestaltIT picking up the tab for the flights, hotel and various other expenses like food. I was however not compensated for my time and there is no requirement to blog or tweet about any of the presentations. Everything I post is of my own accord and because I like what I see and hear.