Performance Considerations
10 Gigabit Ethernet Configurations
By default, the SoftNAS VM (on VMware) ships with the default E1000 virtual NIC adapter and VMware defaults to MTU 1500.
For best performance results above 1 gigabit, simply follow the steps given below.
1. Replace the E1000 virtual NIC adapter with a vmxnet3 on the SoftNAS VM.
2. Use MTU 9000 instead of MTU 1500 for vSwitch, vmKernel and physical switch configurations. Be sure to configure the network interface in SoftNAS for MTU 9000 also.
Refer to the
MTU 9000 section for more information.
A dedicated VLAN for storage traffic is recommended.
iSCSI Multi-pathing
To increase performance throughput and resiliency, use of iSCSI multipathing is recommended by VMware and other vendors.
Since SoftNAS operates in a hypervisor environment, it is possible to configure multi-path operation as follows:
1. On the VMware (Hyper-V) host where the SoftNAS VM runs, install and use multiple physical NIC adapters
2. Assign a dedicated vSwitch for each incoming iSCSI target path (one per physical NIC)
3. Assign the SoftNAS VM a dedicated virtual NIC adapter for each incoming iSCSI target path (per vSwitch/physical NIC)
4. Assign a unique IP address to each corresponding Linux network interface (for each virtual NIC attached to the SoftNAS VM)
5. Restart the SoftNAS iSCSI service and verify connectivity from the iSCSI initiator client(s) to each iSCSI target path.
A dedicated VLAN for storage traffic is recommended.
Other Performance Considerations
As with any storage system, NAS performance is a function of a number of many different combined factors:
-
Disk drive speed and the chosen RAID configuration
-
Cache memory (first level read cache or ARC)
-
2nd level cache (e.g., L2ARC) speed
-
Disk controller and protocol
-
Network bandwidth available; e.g., 1 GBe vs. 10 GbE vs. Infiniband
-
Network QoS (whether the network is dedicated, shared, local vs. remote, EC2 provisioned IOPS, etc.)
-
Network latency (between workload VM's and SoftNAS VM)
-
MTU settings in VM host software and switches
-
Thin-provisioning vs. thick
-
Available CPU (especially when compression is enabled)
-
Network access protocol (NFS, CIFS/SMB, iSCSI, Direct-attached fiber-channel)
-
Use of VLANs to separate storage traffic from other network traffic.
The tradeoffs between cost and performance can be significant, so understanding your actual, initial performance needs, plus contingency plans to address growth in those needs over time, is important when designing your NAS solution.
Virtual Devices and IOPS - As SoftNAS is built atop of ZFS, IOPS (I/O per second) are mostly a factor of the number of virtual devices (vdevs) in a zpool. They are not a factor of the raw number of disks in the zpool. This is probably the single most important thing to realize and understand, and is commonly not. A vdev is a “virtual device”. A Virtual Device is a single device/partition that act as a source for storage on which a pool can be created. For example, in VMware, each vdev can be a VMDK or raw disk device assigned to the SoftNAS VM.
A multi-device or multi-partition vdev can be in one of the following shapes:
Stripe (technically, each chunk of a stripe is its own vdev)
- Mirror
- RaidZ
- A dynamic stripe of multiple mirror and/or RaidZ child vdevs
ZFS stripes writes across vdevs (not individual disks). A vdev is typically IOPS bound to the speed of the slowest disk within it. So if you have one vdev of 100 disks, your zpool's raw IOPS potential is effectively only a single disk, not 100. There's a couple of caveats on here (such as the difference between write and read IOPS, etc), but if you just put as a rule of thumb in your head that a zpool's raw IOPS potential is equivalent to the single slowest disk in each vdev in the zpool, you won't end up surprised or disappointed.
Of course, if you are using hardware RAID which presents a unified datastore to VMware (or Hyper-V), then the actual striping of writes occurs in your RAID controller card. Just be aware of where striping occurs and the implications on performance (especially for write throughput).
Block size, Windows and VMware VMDK workloads - VMware uses 4K block reads and writes. If you have a high-performance VMware use case, be sure to deploy an adequate amount (e.g., 64 GB or more) of write log (ZFS "ZiL") and RAM (plu sread cache (ZFS "L2ARC") to absorb the high level of 4K block I/O for best results). If you have workloads with predominately small (less than 128K) reads and write, making use of RAM, write log and read cache is critical to achieving maximum throughput, as ZFS block I/O occurs in 128K block I/O chunks. Windows also defaults to 4K blocks.
Fortunately, the cost of high-speed media continues to drop, with SSD drives eclipsing high-speed spindle (e.g., 15K SAS), both in terms of performance and cost. And memory has become very affordable, so deploying 64 GB to as much as 1 TB of bus-speed memory is a great way to accelerate your NAS' performance out of the starting gates. Use of SSD for read cache and write logs can also great speed performance, even when front-ending slower SATA mass storage for many uses cases.
7,200 RPM drives are designed for single user, sequential access (not multi-user, virtualized workloads) - Use of 10K or 15K SAS drives in a RAID 10 or RAID6 configuration is recommended as a starting point.
But a fast NAS response to requests isn't the only governing factor to how well your workloads perform. Network design, available bandwidth and latency are also important factors. For example, for high-performance NAS applications, where possible, use of a dedicated VLAN for storage is a must. Configuring all components in the storage path to use MTU 9000 will greatly increase throughput by reducing the effects of round-trip network latency and reducing the interrupt load on the NAS server itself. Interrupts are often overlooked as a source of overhead, because they aren't readily measured, but their effects can be significant, both on the NAS server and workload servers. Make sure you configure any NAS you need the highest level of performance for MTU 9000, along with the switching ports used between the NAS host and workload servers.
A single 1 GbE network segment will, at most, produce up to 120 MB/sec throughput under the most ideal conditions possible. 10 GbE has been observed to deliver up to 1,000 MB/sec of throughput.
The next consideration is protocol - Should you use NFS, CIFS or iSCSI? iSCSI often provides the best throughput, and increased resiliency through multi-pathing. Just be aware of the added complexities associated with iSCSI.
For VM-based workloads - it's hard to go wrong with NFS or iSCSI. For user data (e.g., file shares), CIFS are more common because of the need to integrate natively with Windows, domain controllers and Active Directory when using a NAS as a file server.
Thick-provisioning VMware datastores provides increased write performance, and should be preferred over thin-provsioning of VMDK's when optimal performance is required.
Whatever design you come up with, it's important to verify your implementation by running performance benchmarks to validate you are actually seeing the throughput expected (before you go into production).
One approach that works well for a broad range of applications is to use a combination of SAS and SATA drives - using SSD for read cache/write log (always configure write logs as mirrored pairs in case a drive fails). SATA drives provide very high densities in a relatively small footprint, which is perfect for user mass storage, Windows profiles, Office files, MS Exchange, etc. SQL Server typically demands SAS and/or SSD for best results, due to the high transaction rates involved. Exchange can be relatively heavy on I/O when it's starting up, but since it reads most everything into memory, high-speed caching does little to help run-time performance after initial startup.
Virtual desktops benefit greatly from all the cache memory, level 2 caching and high-speed storage you can afford, because many performance lags quickly become visible as user launch applications, open and save files, etc. Caching also helps alleviate "login storms" and "boot storms" that occur when a large number of simultaneous users attempt to log in first thing in the morning. For these situations, a combination of local caching (on each VDI server), combined with appropriate caching for user profiles and applications can yield excellent results.
Deduplication Is Not Free - A common misunderstanding is that ZFS deduplication is free, which can enable space savings on your ZFS filesystems/zvols/zpools. Nothing could be farther from the truth. ZFS deduplication is performance on-the-fly as data is read and written. This can lead to a significant and sometimes unexpectedly high RAM requirement.
Every block of data in a dedup'ed filesystem can end up having an entry in a database known as the DDT (DeDupe Table). DDT entries need RAM. It is not uncommon for DDT's to grow to sizes larger than available RAM on zpools that aren't even that large (couple of TB's). If the hits against the DDT aren't being serviced primarily from RAM (or fast SSD configured as L2ARC), performance quickly drops to abysmal levels. Because enabling/disabling deduplication within ZFS doesn't actually do anything to the data already committed on disk, it recommended that you do not enable deduplication without a full understanding of its RAM and caching requirements. You will be hard-pressed to get rid of it later after you have many terabytes of deduplicated data already written to disk and discover you need more RAM and/or cache; i.e., plan your cache and RAM needs around how much total deduplicated data you expect to have.
A general rule of thumb is to provide at least 2 GB of DDT per TB of deduplicated data (actual results will vary based on how much duplication of data you actually have).
Please note that the DDT tables require RAM beyond whatever you need for caching of data, so be sure to take this into account (RAM is very affordable these days, so get more than you think you may need to be on the safe side).
Extremely Large Destroy Operations - When you destroy large filesystems, snapshots and cloned filesystems (e.g., in excess of a terabyte), the data is not immediately deleted - it is scheduled for background deletion processing. The deletion process touches many metadata blocks, and in a heavily deduplicated pool, must also look up and update the DDT to ensure the block reference counts are properly maintained. This results in a significant amount of additional I/O, which can impact the total IOPS available for production workloads.
For best results, schedule large destroy operations for after hours or weekends so those deletion processing IOPS will not impact the IOPS available for normal business day operations.