Fsync Performance on Storage Devices: The Role of Doublewrite and Flush-Based Synchronization

aleidawotring0926f
Aug 15, 2023
5 min read

Because the fsync call takes time, it greatly affects the performance of MySQL. Because of this, you probably noticed there are many status variables that relate to fsyncs. To overcome the inherent limitations of the storage devices, group commit allows multiple simultaneous transactions to fsync the log file once for all the transactions waiting for the fsync. There is no need for a transaction to call fsync for a write operation that another transaction already forced to disk. A series of write transactions sent over a single database connection cannot benefit from group commit.

Fsync Performance on Storage Devices

DOWNLOAD

With the above number, the possible transaction rates in fully ACID mode is pretty depressing. But those drives were rotating ones, what about SSD drives? SSD are memory devices and are much faster for random IO operations. There are extremely fast for reads and good for writes. But as you will see below, not that great for fsyncs.

The fsync system is not the only system call that persists data to disk. There is also the fdatasync call. fdatasync persists the data to disk but does not update the metadata information like the file size and last update time. Said otherwise, it performs one write operation instead of two. In the Python script, if I replace os.fsync with os.fdatasync, here are the results for a subset of devices:

I tested your fsync.py using our SAN HPE 3PAR StoreServ 8400 storage.It is relatively high level flash-based storage device.10 000 iterations took 19.303s or 1.903 ms per fsync ( 518 fsyncs / seconds).

Upon checkpoint, dirty buffers in shared buffers are written to the page cache managed by kernel. Through an fsync(), these modified blocks are applied to disk. If an fsync() call is successful, all dirty pages from the corresponding file are guaranteed to be persisted on the disk. When there is an fsync to flush the pages to disk, PostgreSQL cannot guarantee a copy of a modified/dirty page. The reason is that writes to storage from the page cache are completely managed by the kernel, and not by PostgreSQL.

In a fully durable configuration MySQL tends to be impacted even more by poor fsync() performance. It may need to perform as many as three fsync operations per transaction commit. Group commit reduces the impact on throughput but transaction latency will still be severely impacted

With the above number, the possible transaction rates in fully ACID mode is pretty depressing. But those drives were rotating ones, what about SSD drives? SSD are memory devices and are much faster for random IO operations. There are extremely fast for reads, and good for writes. But as you will see below, not that great for fsyncs.

A few years ago, Intel introduced a new type of storage devices based on the 3D_XPoint technology and sold under the Optane brand. Those devices are outperforming regular flash devices and have higher endurance. In the context of this post, I found they are also very good at handling the fsync call, something many flash devices are not great at doing.

The above results are pretty amazing. The fsync performance is on par with a RAID controller with a write cache, for which I got a rate of 23000/s and is much better than a regular NAND based NVMe card like the Intel PC-3700, able to deliver a fsync rate of 7300/s. Even enabling the full ext4 journal, the rate is still excellent although, as expected, cut by about half.

If you have a large dataset, you can still use the Optane card as a read/write cache and improve fsync performance significantly. I did some tests with two easily available solutions, dm-cache and bcache. In both cases, the Optane card was put in front of an external USB Sata disk and the cache layer set to writeback.

In my previous post Testing Samsung storage in tpcc-mysql benchmark of Percona Server I compared different Samsung devices. Most solid state drives (SSDs) use 4KiB as an internal page size, and the InnoDB default page size is 16KiB. I wondered how using a different innodb_page_size might affect the overall performance.

While working on the service architecture for one of our projects, I considered several SATA SSD options as the possible main storage for the data. The system will be quite write intensive, so the main interest is the write performance on capacities close to full-size storage.

Persistent disks are networked storage and generally have higher latencycompared to physical disks or local SSDs.To reach the maximum performance limits of your persistent disks, you mustissue enough I/O requests in parallel. To check if you're using a high enoughqueue depth to reach your required performance levels, seeI/O queue depth.

NAND flash memory has been used widely in various mobile devices like smartphones, tablets and MP3 players. Furthermore, server systems started utilizing flash devices as their primary storage. Despite its broad use, flash memory has several limitations, like erase-before-write requirement, the need to write on erased blocks sequentially and limited write cycles per erase block.

These file systems directly access NAND flash memories while addressing all the chip-level issues such as wear-levelling and bad-block management. Unlike these systems, F2FS targets flash storage devices that come with a dedicated controller and firmware (FTL) to handle low-level tasks. Such flash storage devices are more commonplace.

F2FS was designed from scratch to optimize the performance and lifetime of flash devices with a generic block interface. It builds on the concept of the Log-Structured Filesystem (LFS), but also introduces a number of new design considerations:

Applications like database (e.g., SQLite) frequently write small data to a file and conduct fsync to guarantee durability. A naive approach to supporting fsync would be to trigger checkpointing and recover data with the roll-back model. However, this approach leads to poor performance, as checkpointing involves writing all node and dentry blocks unrelated to the database file. F2FS implements an efficient roll-forward recovery mechanism to enhance fsync performance. The key idea is to write data blocks and their direct node blocks only, excluding other node or F2FS metadata blocks. In order to find the data blocks selectively after rolling back to the stable checkpoint, F2FS retains a special flag inside direct node blocks.

Experimental results showed that adaptive logging is critical to sustain performance at high storage utilization levels. The adaptive logging policy is also shown to effectively limit the performance degradation of F2FS due to fragmentation.

Some storage optimized instance types provide the ability to control processor C-states and P-states on Linux. C-states control the sleep levels that a core can enter when it is inactive, while P-states control the desired performance (in CPU frequency) from a core. For more information, see Processor state control for your EC2 instance.

SSD controllers can use several strategies to reduce the impact of write amplification. One such strategy is to reserve space in the SSD instance storage so that the controller can more efficiently manage the space available for write operations. This is called over-provisioning. The SSD-based instance store volumes provided to an instance don't have any space reserved for over-provisioning. To reduce write amplification, we recommend that you leave 10% of the volume unpartitioned so that the SSD controller can use it for over-provisioning. This decreases the storage that you can use, but increases performance even if the disk is close to full capacity. 2ff7e9595c

Fsync Performance on Storage Devices: The Role of Doublewrite and Flush-Based Synchronization

Fsync Performance on Storage Devices

Recent Posts

Comments

About Me

#LeapofFaith

Posts Archive

Keep Your Friends
Close & My Posts Closer.

Fsync Performance on Storage Devices

Comments

About Me

#LeapofFaith

Posts Archive

Keep Your Friends Close & My Posts Closer.

Keep Your Friends
Close & My Posts Closer.