It is a fact of IT life: hardware becomes faster and more powerful with every new generation on the market. That absolutely applies to CPUs. A few weeks ago at Intel’s Data Centric Innovation Day in San Francisco, Intel presented their new Intel Xeon scalable processors. These beasts now scale up to 56 cores per socket, with up to 8 sockets per system/motherboard. This incredible amount of compute power enables applications to “do things”, whether it’s analytics, machine learning, or running cloud applications.
One thing in common across all applications is that they don’t want to wait for data. As soon as your %iowait is going up, you are wasting your precious and expensive compute power because the storage subsystem is not fast enough. Fortunately, WekaIO wants to make sure this will not be the case for your applications.
WekaIO and cloud native, technical compute workloads
WekaIO offers, in their words, the fastest and most scalable parallel file system for AI and technical compute workloads. Where you run these workloads (on-premises or in the cloud) doesn’t matter. Some customers of WekaIO run them in hybrid models too.
The WekaIO file system is a fully coherent POSIX file system that is faster than a local filesystem. It uses distributed coding, becomes more resilient at scale, and features fast rebuilds with end-to-end data protection. It looks like erasure coding, but without the performance penalty.
Moreover, it can be deployed as either a converged appliance (WekaIO file system + applications on shared infrastructure) or as a dedicated filesystem. The software runs in a container, and you decide which model it will be. Both models are fully cloud native, and for good reason. As Liran Zvibel explains, you can chunk away a piece of workload, push it to the cloud, and compute it over there. Then snapshot the results and push that snapshot back to the on-premises system. This leverages the elasticity of the public cloud in quickly processing a work set. It is not possible to lock files between both locations, but in this batch-based manner, it does not have to.
Front-end protocols are S3 object storage, SMB, NFS. The back-end is either Infiniband or Ethernet; preferably 100Gbps, but you can start with 10Gbps.
Artificial Intelligence and Machine Learning
WekaIO aims for the performance use cases. Main vertical is AI/ML, with secondary markets being genomics and financials. So why would a customer choose WekaIO?
- All performance is available on a single mount point
- Existing NAS systems cannot deliver enough performance
- Low latency is of the utmost importance
- Metadata performance is critical, and your current system isn’t delivering
- Capacity needs to scale along with performance
- And the economics of the previously discussed hybrid cloud model make sense for your workloads
From an operational perspective…
The product demo immediately showed us a very fancy HTML5 GUI, and a CLI. Scalability wise, the WekaIO file system can run on a minimum of 6 nodes. Theoretically, the largest such a cluster can grow is 64000 (!) nodes, with real life clusters of A 1000 nodes already in use.
Most users start with small minimum size cluster, then add nodes on the fly. It could even be an automated expansion when predetermined limits are reached. Data will be rebalanced automatically: metadata and CPU load immediately, data is rebalanced lazily.
In the cloud, everything costs money. That’s why you can spin up and spin down a cluster on demand, to conserve cloud resources and lower costs. Spinning up a WekaIO cluster takes roughly 15 minutes. There’s even a configurator, which generates a JSON that you can push into your AWS account.
Performance and tiering
During the demo, WekaIO showed us some incredible performance numbers for single client random read and random writes. While failing 1 and/or 2 nodes, performance stays the same, albeit after a brief interruption in the workload. The GUI is very helpful during these failures, clearly showing the failure and impact on data protection. Dual protection data is degraded to single protection and, on a second failure, to no protection. After the second node failure, the cluster will work harder to get the data that lost all protection back to a single protection state. Once that’s done, it will start working on restoring the dual protection. It sounds logical to restore data protection in that order, but I am not sure other systems out there do it like that, or show it.
These @WekaIO performance benchmarks are blowing our mind. 11.6GB/s random read for a single node, saturating a 100Gb port. 9.8GB/s random write. Even when degraded (1 or 2 nodes down), perf is great! #SFD18
— Jon Klaus (@JonKlaus) February 27, 2019
Should you have any data on the WekaIO file system that does not need the highest performance, you can tier it to an object store. The data in this object store is still accessible through the global namespace; it just resembles two tiers. WekaIO tests the S3 partners/buckets; AWS, Cloudian and Scality are already approved. For this, WekaIO only needs a bucket in the object store, not the entire object store.
WekaIO code base
Design wise, all data is using the same single, network-based data path. If the data is on another node, it goes out over the network to that other node. If the data is on the same node, it will still go out over the network but to a local loopback; effectively this is a local loopback address. This cleans up the code base and avoids having to write 2+ scenarios for each potential location of the data.
One of the features that is on the roadmap is QoS, although it is currently not a high priority item. Some service providers are asking for QoS, but the performance limits are not reached and thus nobody is in pain now. Over time, the WekaIO team will add QoS and other features as the product evolves.
My thoughts on the WekaIO file system
The unique selling point for WekaIO is performance. High performance. And this was apparent during the demo! Apart from crunching through millions of IOps and hundreds of Gbit/s in demos, the product feels and looks solid. The modern and intuitive GUI and small things like precompiled JSON scripts to deploy WekaIO in AWS are a godsend for operational folks.
The WekaIO team has some bold claims (fastest and best scalable parallel file system), but it sounds and feels grounded in solid technology. The presenting team didn’t obscure details of the product in mists and vague claims, but was very open about everything. I am very curious to see how WekaIO will evolve and would love to give it a spin. I might have to nudge myself into an environment that needs 100’s of Gbps of throughput though!
Check out the presentation on the Storage Field Day 18 event page. My fellow delegates were also much faster in writing up their posts: check out these posts from Chin-Fah, Chris and Dan to name a few of them.
Disclaimer: I wouldn’t have been able to attend Storage Field Day 18 without GestaltIT picking up the tab for the flights, hotel and various other expenses like food. I was however not compensated for my time and there is no requirement to blog or tweet about any of the presentations. Everything I post is of my own accord and because I like what I see and hear.