Performance Considerations
More Memory, Better Performance
The easiest way to gain performance with SoftNAS is to add more RAM as main memory. By default, SoftNAS allocates 50% of RAM for caching. RAM caching operates at near bus speeds and is the fastest cache available. Memory is relatively inexpensive, so provide SoftNAS with as much RAM as you can to use as first level cache for best performance results.
Cache Devices
Solid state disk (SSD) and PCIE flash cache cards offer high-speed read caching and transaction logging for synchronous writes. However, not all SSD's are created equal and some are better for these tasks than others. In particular, pay close attention to the specifications regarding 4K IOPS.
For read caching (L2ARC), both read and write IOPS matter, as does sequential throughput specifications of the device. If you will be running database, VMware VMDK or other workloads that produce large amounts of random, small (e.g., 4KB) reads and writes, then ensure the SSD and flash cache devices provide high IOPS for 4K reads/writes.
For the write log (ZIL), extremely fast write IOPS is most important (the ZIL is only read after a power failure or other outage event to replay synchronous write transactions that may not have been posted prior to the outage, so write IOPS is most critical for use as a ZIL). ZFS always uses a ZIL (unless you set "sync=disabled"). By default, the ZIL uses the devices which comprise the storage pool. An "SLOG" device (we call it a "Write Log" in SoftNAS) offloads the ZIL from the main pool to a separate log device, which improves performance when the right log device is chosen and configured properly.
2nd Level Read Cache
To further improve read and query performance, configure a Read Cache device for use with SoftNAS. SoftNAS leverages the ZFS "L2ARC" as its second level cache.
AWS EC2 Read Cache
On Amazon EC2, choose an instance type which includes local solid state disk (SSD) disks. The storage server will make use of as much read cache as you provide it. Read cache devices can be added and removed at any time with no risk of data loss to an existing storage pool. There are two choices for SSD read cache on EC2:
1) Local SSD - this is the fastest read cache available, as the local SSD's are directly attached to each EC2 instance and provide up to 120,000 IOPS
2) EBS Provisioned IOPS - these EBS volumes can be assigned to SSD, providing a specified level of guaranteed IOPS
VMware / Hyper-V Read Cache
On VMware or Hyper-V, add one or more SSD devices to your storage server. Use of a properly-designed read cache is essential to get the IOPS and throughput for database, VDI, vMotion and other workloads comprised primarily of small I/O operations (e.g., lots of small files, VMDK's, database transactions, etc.)
There are several ways to get the most performance from these cache devices (in order of best performance):
1) Pass-through Controller - in this configuration, the disk controller is passed through to the SoftNAS VM. This enables the SoftNAS Linux operating system to directly interact with the disk controller. This provides the best possible performance, but requires CPU's and motherboards which support Intel VT-d and disk controllers supported by CentOS operating system. Note that for servers with the disk controller built into the motherboard, it is now common to install VMware or Windows Hyper-V and boot from USB, which frees up the disk controller for pass-through use. Refer to the
Performance Tuning for VMware section for details.
2) PCIE Flash Cache Cards - there are flash memory plug-in cards with extremely fast NAND memory available in PCIE form. These make extremely fast memory available at high speeds through the PCIE bus. Be sure to choose a PCIE flash memory card that is supported by your virtualization vendor (e.g., VMware, Hyper-V, etc.)
3) Raw Device Mapping - some SSD devices can be mapped directly to the SoftNAS VM using "Raw Device Mapping (RDM)". Raw device access allows SCSI commands to flow directly between the SoftNAS CentOS operating system and the SSD device (by-passing the VM host's file system such as VMFS) for peak cache performance and IOPS, and to reduce context-switching between the SoftNAS VM running CentOS and the virtualization host. Refer to Raw Device Mapping in VMware for more details (note that RDM is not officially supported, so do your own testing and validation of RDM devices and configurations for stability before attempting to use in production). Disk controller pass-through is preferred to RDM on systems with processors and configurations that support it.
When designing your cache, it's ideal (but not absolutely required) to use the cache devices directly, instead of as VMFS (VMware) and NTFS (Windows) filesystems, where possible. Connect the cache device directly to the SoftNAS VM for best results. The reason is that both VMFS and NTFS are designed with 1 MB blocks. The data that benefits most from direct mapping to the cache device are small I/O (e.g., 4K and 8K blocks) Direct I/O with the cache devices provides the most efficient cache operation.
Consult your virtualization vendor's website for more details on supported pass-through disk controllers, PCIE flash cache cards, raw device mapping and other available caching technology availability.
Write Log
The "write log" on SoftNAS leverages the ZFS Intent Log (ZIL). The ZIL is a "transaction log" used to record synchronous writes (not asynchronous writes). When SoftNAS receives synchronous write requests, before returning to the caller, ZFS first records the write in memory and then completes the write to the ZIL. By default, the ZIL is located on the same persistent storage associated with the storage pool (e.g., spinning disk media). Once the write is recorded in the ZIL, the synchronous write is completed and the NFS, CIFS or iSCSI request returns to the caller.
To increase performance of synchronous writes, add a separate write log (sometimes referred to as a "SLOG") device, as discussed in the Read Cache section above. A separate write log device enables ZFS to quickly store synchronous write data and return to the caller. Note that this write log is only actually referenced in the event of a power failure or VM / instance crash, to replay the transactions that were not committed prior to the outage event. Writes remain in RAM cache, to satisfy subsequent read requests and to write to stage to permanent storage during normal transaction processing (every 5 seconds by default). So the Write Cache provides a safe, fast way to ensure there will be no data loss for synchronous write operations and enable the writes to complete as rapidly as possible.
One of the methods described above for Read Cache (pass-through controller, PCIE flash cache cards and raw device mapping) should be used for the write log device.
EC2 Users - do not use local SSD or ephemeral disks attached directly to the EC2 instance for the write log, as these instance local devices are not guaranteed to be available again after reboot. Instead, use EBS volumes with Provisioned IOPS for the Write Log (it's okay to use local SSD devices for Read Cache).
10 Gigabit Network Configurations on VMware
By default, the SoftNAS VM (on VMware) ships with the default E1000 virtual NIC adapter and VMware defaults to MTU 1500.
For best performance results above 1 gigabit, follow the steps outlined below.
1. Replace the E1000 virtual NIC adapter with a vmxnet3 on the SoftNAS VM.
2. Use MTU 9000 instead of MTU 1500 for vSwitch, vmKernel and physical switch configurations. Be sure to configure the network interface in SoftNAS for MTU 9000 also.
Refer to the
MTU 9000 section for more information.
iSCSI Multi-pathing
To increase performance throughput and resiliency, use of iSCSI multipathing is recommended by VMware and other vendors.
Since SoftNAS operates in a hypervisor environment, it is possible to configure multi-path operation as follows:
1. On the VMware (Hyper-V) host where the SoftNAS VM runs, install and use multiple physical NIC adapters
2. Assign a dedicated vSwitch for each incoming iSCSI target path (one per physical NIC)
3. Assign the SoftNAS VM a dedicated virtual NIC adapter for each incoming iSCSI target path (per vSwitch/physical NIC)
4. Assign a unique IP address to each corresponding Linux network interface (for each virtual NIC attached to the SoftNAS VM)
5. Restart the SoftNAS iSCSI service and verify connectivity from the iSCSI initiator client(s) to each iSCSI target path.
A dedicated VLAN for storage traffic is recommended.
Other Performance Considerations
As with any storage system, NAS performance is a function of a number of many different combined factors:
-
Disk drive speed and the chosen RAID configuration
-
Cache memory (first level read cache or ARC)
-
2nd level cache (e.g., L2ARC) speed
-
Disk controller and protocol
-
Network bandwidth available; e.g., 1 GBe vs. 10 GbE vs. Infiniband
-
Network QoS (whether the network is dedicated, shared, local vs. remote, EC2 provisioned IOPS, etc.)
-
Network latency (between workload VM's and SoftNAS VM)
-
MTU settings in VM host software and switches
-
Thin-provisioning vs. thick
-
Available CPU (especially when compression is enabled)
-
Network access protocol (NFS, CIFS/SMB, iSCSI, Direct-attached fiber-channel)
-
Use of VLANs to separate storage traffic from other network traffic.
The tradeoffs between cost and performance can be significant, so understanding your actual, initial performance needs, plus contingency plans to address growth in those needs over time, is important when designing your NAS solution.
Virtual Devices and IOPS - As SoftNAS is built atop of ZFS, IOPS (I/O per second) are mostly a factor of the number of virtual devices (vdevs) in a zpool. They are not a factor of the raw number of disks in the zpool. This is probably the single most important thing to realize and understand, and is commonly not. A vdev is a “virtual device”. A Virtual Device is a single device/partition that act as a source for storage on which a pool can be created. For example, in VMware, each vdev can be a VMDK or raw disk device assigned to the SoftNAS VM.
A multi-device or multi-partition vdev can be in one of the following shapes:
Stripe (technically, each chunk of a stripe is its own vdev)
- Mirror
- RaidZ
- A dynamic stripe of multiple mirror and/or RaidZ child vdevs
ZFS stripes writes across vdevs (not individual disks). A vdev is typically IOPS bound to the speed of the slowest disk within it. So if you have one vdev of 100 disks, your zpool's raw IOPS potential is effectively only a single disk, not 100. There's a couple of caveats on here (such as the difference between write and read IOPS, etc), but if you just put as a rule of thumb in your head that a zpool's raw IOPS potential is equivalent to the single slowest disk in each vdev in the zpool, you won't end up surprised or disappointed.
Of course, if you are using hardware RAID which presents a unified datastore to VMware (or Hyper-V), then the actual striping of writes occurs in your RAID controller card. Just be aware of where striping occurs and the implications on performance (especially for write throughput).
Block size, Windows and VMware VMDK workloads - VMware uses 4K block reads and writes. If you have a high-performance VMware use case, be sure to deploy an adequate amount (e.g., 64 GB or more) of write log (ZFS "ZiL") and RAM (plu sread cache (ZFS "L2ARC") to absorb the high level of 4K block I/O for best results). If you have workloads with predominately small (less than 128K) reads and write, making use of RAM, write log and read cache is critical to achieving maximum throughput, as ZFS block I/O occurs in 128K block I/O chunks. Windows also defaults to 4K blocks.
Fortunately, the cost of high-speed media continues to drop, with SSD drives eclipsing high-speed spindle (e.g., 15K SAS), both in terms of performance and cost. And memory has become very affordable, so deploying 64 GB to as much as 1 TB of bus-speed memory is a great way to accelerate your NAS' performance out of the starting gates. Use of SSD for read cache and write logs can also great speed performance, even when front-ending slower SATA mass storage for many uses cases.
7,200 RPM drives are designed for single user, sequential access (not multi-user, virtualized workloads) - Use of 10K or 15K SAS drives in a RAID 10 or RAID6 configuration is recommended as a starting point.
But a fast NAS response to requests isn't the only governing factor to how well your workloads perform. Network design, available bandwidth and latency are also important factors. For example, for high-performance NAS applications, where possible, use of a dedicated VLAN for storage is a must. Configuring all components in the storage path to use MTU 9000 will greatly increase throughput by reducing the effects of round-trip network latency and reducing the interrupt load on the NAS server itself. Interrupts are often overlooked as a source of overhead, because they aren't readily measured, but their effects can be significant, both on the NAS server and workload servers. Make sure you configure any NAS you need the highest level of performance for MTU 9000, along with the switching ports used between the NAS host and workload servers.
A single 1 GbE network segment will, at most, produce up to 120 MB/sec throughput under the most ideal conditions possible. 10 GbE has been observed to deliver up to 1,000 MB/sec of throughput.
The next consideration is protocol - Should you use NFS, CIFS or iSCSI? iSCSI often provides the best throughput, and increased resiliency through multi-pathing. Just be aware of the added complexities associated with iSCSI.
For VM-based workloads - it's hard to go wrong with NFS or iSCSI. For user data (e.g., file shares), CIFS are more common because of the need to integrate natively with Windows, domain controllers and Active Directory when using a NAS as a file server.
Thick-provisioning VMware datastores provides increased write performance, and should be preferred over thin-provsioning of VMDK's when optimal performance is required.
Whatever design you come up with, it's important to verify your implementation by running performance benchmarks to validate you are actually seeing the throughput expected (before you go into production).
One approach that works well for a broad range of applications is to use a combination of SAS and SATA drives - using SSD for read cache/write log (always configure write logs as mirrored pairs in case a drive fails). SATA drives provide very high densities in a relatively small footprint, which is perfect for user mass storage, Windows profiles, Office files, MS Exchange, etc. SQL Server typically demands SAS and/or SSD for best results, due to the high transaction rates involved. Exchange can be relatively heavy on I/O when it's starting up, but since it reads most everything into memory, high-speed caching does little to help run-time performance after initial startup.
Virtual desktops benefit greatly from all the cache memory, level 2 caching and high-speed storage you can afford, because many performance lags quickly become visible as user launch applications, open and save files, etc. Caching also helps alleviate "login storms" and "boot storms" that occur when a large number of simultaneous users attempt to log in first thing in the morning. For these situations, a combination of local caching (on each VDI server), combined with appropriate caching for user profiles and applications can yield excellent results.
Deduplication Is Not Free - A common misunderstanding is that ZFS deduplication is free, which can enable space savings on your ZFS filesystems/zvols/zpools. Nothing could be farther from the truth. ZFS deduplication is performance on-the-fly as data is read and written. This can lead to a significant and sometimes unexpectedly high RAM requirement.
Every block of data in a dedup'ed filesystem can end up having an entry in a database known as the DDT (DeDupe Table). DDT entries need RAM. It is not uncommon for DDT's to grow to sizes larger than available RAM on zpools that aren't even that large (couple of TB's). If the hits against the DDT aren't being serviced primarily from RAM (or fast SSD configured as L2ARC), performance quickly drops to abysmal levels. Because enabling/disabling deduplication within ZFS doesn't actually do anything to the data already committed on disk, it recommended that you do not enable deduplication without a full understanding of its RAM and caching requirements. You will be hard-pressed to get rid of it later after you have many terabytes of deduplicated data already written to disk and discover you need more RAM and/or cache; i.e., plan your cache and RAM needs around how much total deduplicated data you expect to have.
A general rule of thumb is to provide at least 2 GB of DDT per TB of deduplicated data (actual results will vary based on how much duplication of data you actually have).
Please note that the DDT tables require RAM beyond whatever you need for caching of data, so be sure to take this into account (RAM is very affordable these days, so get more than you think you may need to be on the safe side).
Extremely Large Destroy Operations - When you destroy large filesystems, snapshots and cloned filesystems (e.g., in excess of a terabyte), the data is not immediately deleted - it is scheduled for background deletion processing. The deletion process touches many metadata blocks, and in a heavily deduplicated pool, must also look up and update the DDT to ensure the block reference counts are properly maintained. This results in a significant amount of additional I/O, which can impact the total IOPS available for production workloads.
For best results, schedule large destroy operations for after hours or weekends so those deletion processing IOPS will not impact the IOPS available for normal business day operations.