Linux系统调用fsync函数详解
人气:0
功能描述:
同步内存中所有已修改的文件数据到储存设备。
用法:
#include <unistd.h>
int fsync(int fd);
参数:
fd:文件描述词。
返回说明:
成功执行时,返回0。失败返回-1,errno被设为以下的某个值
EBADF: 文件描述词无效
EIO : 读写的过程中发生错误
EROFS, EINVAL:文件所在的文件系统不支持同步
强制把系统缓存写入文件sync和fsync函数,, fflush和fsync的联系和区别2010-05-10 11:25传统的U N I X实现在内核中设有缓冲存储器,大多数磁盘I / O都通过缓存进行。当将数据写
到文件上时,通常该数据先由内核复制到缓存中,如果该缓存尚未写满,则并不将其排入输出
队列,而是等待其写满或者当内核需要重用该缓存以便存放其他磁盘块数据时,再将该缓存排
入输出队列,然后待其到达队首时,才进行实际的I / O操作。这种输出方式被称之为延迟写
(delayed write)(Bach 〔1 9 8 6〕第3章详细讨论了延迟写)。延迟写减少了磁盘读写次数,但是
第4章文件和目录8 7
下载
却降低了文件内容的更新速度,使得欲写到文件中的数据在一段时间内并没有写到磁盘上。当
系统发生故障时,这种延迟可能造成文件更新内容的丢失。为了保证磁盘上实际文件系统与缓
存中内容的一致性,U N I X系统提供了s y n c和f s y n c两个系统调用函数。
#include <unistd.h>
void sync(void);
int fsync(intf i l e d e s) ;
返回:若成功则为0,若出错则为-1
s y n c只是将所有修改过的块的缓存排入写队列,然后就返回,它并不等待实际I / O操作结束。
系统精灵进程(通常称为u p d a t e )一般每隔3 0秒调用一次s y n c函数。这就保证了定期刷新内
核的块缓存。命令s y n c ( 1 )也调用s y n c函数。
函数f s y n c只引用单个文件(由文件描述符f i l e d e s指定),它等待I / O结束,然后返回。f s y n c可
用于数据库这样的应用程序,它确保修改过的块立即写到磁盘上。比较一下f s y n c和O _ S Y N C标
志(见3 . 1 3节)。当调用f s y n c时,它更新文件的内容,而对于O _ S Y N C,则每次对文件调用w r i t e
函数时就更新文件的内容。
fflush和fsync的联系和区别
[zz ] http://blog.chinaunix.net/u2/73874/showart_1421917.html
1.提供者fflush是libc.a中提供的方法,fsync是系统提供的系统调用。2.原形fflush接受一个参数FILE *.fflush(FILE *);fsync接受的时一个Int型的文件描述符。fsync(int fd);3.功能fflush:是把C库中的缓冲调用write函数写到磁盘[其实是写到内核的缓冲区]。fsync:是把内核缓冲刷到磁盘上。
c库缓冲-----fflush---------〉内核缓冲--------fsync-----〉磁盘
再转一篇英文的
Write-back support
UBIFS supports write-back, which means that file changes do not go to the flash media straight away, but they are cached and go to the flash later, when it is absolutely necessary. This helps to greatly reduce the amount of I/O which results in better performance. Write-back caching is a standard technique which is used by most file systems like ext3 or XFS.
In contrast, JFFS2 does not have write-back support and all the JFFS2 file system changes go the flash synchronously. Well, this is not completely true and JFFS2 does have a small buffer of a NAND page size (if the underlying flash is NAND). This buffer contains last written data and is flushed once it is full. However, because the amount of cached data are very small, JFFS2 is very close to a synchronous file system.
Write-back support requires the application programmers to take extra care about synchronizing important files in time. Otherwise the files may corrupt or disappear in case of power-cuts, which happens very often in many embedded devices. Let's take a glimpse at Linux manual pages:
$ man 2 write
....
NOTES
A successful return from write() does not make any guarantee that data
has been committed to disk. In fact, on some buggy implementations, it
does not even guarantee that space has successfully been reserved for
the data. The only way to be sure is to call fsync(2) after you are
done writing all your data.
...
This is true for UBIFS (except of the "some buggy implementations" part, because UBIFS does reserves space for cached dirty data). This is also true for JFFS2, as well as for any other Linux file system.
However, some (perhaps not very good) user-space programmers do not take write-back into account. They do not read manual pages carefully. When such applications are used in embedded systems which run JFFS2 - they work fine, because JFFS2 is almost synchronous. Of course, the applications are buggy, but they appear to work well enough with JFFS2. But the bugs show up when UBIFS is used instead. Please, be careful and check/test your applications with respect to power cut tolerance if you switch from JFFS2 to UBIFS. The following is a list of useful hints and advices.
If you want to switch into synchronous mode, use the -o sync option when mounting UBIFS; however, the file system performance will drop - be careful; Also remember that UBIFS mounted in synchronous mode provides less guarantees than JFFS2 - refer this section for details.
Always keep in mind the above statement from the manual pages and run fsync() for all important files you change; of course, there is no need to synchronize "throw-away" temporary files; Just think how important is the file data and decide; and do not use fsync() unnecessarily, because this will hit the performance;
If you want to be more accurate, you may use fdatasync(), in which cases only data changes will be flushed, but not inode meta-data changes (e.g., "mtime" or permissions); this might be more optimal than using fsync() if the synchronization is done often, e.g., in a loop; otherwise just stick with fsync();
In shell, the sync command may be used, but it synchronizes whole file system which might be not very optimal; and there is a similar libc sync() function;
You may use the O_SYNC flag of the open() call; this will make sure all the data (but not meta-data) changes go to the media before the write() operation returns; but in general, it is better to use fsync(), because O_SYNC makes each write to be synchronous, while fsync() allows to accumulate many writes and synchronize them at once;
It is possible to make certain inodes to be synchronous by default by setting the "sync" inode flag; in a shell, the chattr +S command may be used; in C programs, use the FS_IOC_SETFLAGS ioctl command; Note, the mkfs.ubifs tool checks for the "sync" flag in the original FS tree, so the synchronous files in the original FS tree will be synchronous in the resulting UBIFS image.
Let us stress that the above items are true for any Linux file system, including JFFS2.
fsync() may be called for directories - it synchronizes the directory inode meta-data. The "sync" flag may also be set for directories to make the directory inode synchronous. But the flag is inherited, which means all new children of this directory will also have this flag. New files and sub-directories of this directory will also be synchronous, and their children, and so forth. This feature is very useful if one needs to create a whole sub-tree of synchronous files and directories, or to make all new children of some directory to be synchronous by default (e.g., /etc).
The fdatasync() call for directories is "no-op" in UBIFS and all UBIFS operations which change directory entries are synchronous. However, you should not assume this for portability (e.g., this is not true for ext2). Similarly, the "dirsync" inode flag has no effect in UBIFS.
The functions mentioned above work on file-descriptors, not on streams (FILE *). To synchronize a stream, you should first get its file descriptor using the fileno() libc function, then flush the stream using fflush(), and then synchronize the file using fsync() or fdatasync(). You may use other synchronization methods, but remember to flush the stream before synchronizing the file. The fflush() function flushes the libc-level buffers, while sync(), fsync(), etc flush kernel-level buffers.
Please, refer this FAQ entry for information about how to atomically update the contents of a file. Also, the Theodore Tso's article is a good reading.
Write-back knobs in Linux
Linux has several knobs in "/proc/sys/vm" which you may use to tune write-back. The knobs are global, so they affect all file-systems. Please, refer the "Documentation/sysctl/vm.txt" file fore more information. The file may be found in the Linux kernel source tree. Below are interesting knobs described in UBIFS context and in a simplified form.
dirty_writeback_centisecs - how often the Linux periodic write-back thread wakes up and writes out dirty data. This is a mechanism which makes sure all dirty data hits the media at some point.
dirty_expire_centisecs - dirty data expire period. This is maximum time data may stay dirty. After this period of time it will be written back by the Linux periodic write-back thread. IOW, the periodic write-back thread wakes up every "dirty_writeback_centisecs" centi-seconds and synchronizes data which was dirtied "dirty_expire_centisecs" centi-seconds ago.
dirty_background_ratio - maximum amount of dirty data in percent of total memory. When the amount of dirty data becomes larger, the periodic write-back thread starts synchronizing it until it becomes smaller. Even non-expired data will be synchronized. This may be used to set a "soft" limit for the amount of dirty data in the system.
dirty_ratio - maximum amount of dirty data at which writers will first synchronize the existing dirty data before adding more. IOW, this is a "hard" limit of the amount of dirty data in the system.
Note, UBIFS additionally has small write-buffers which are synchronized every 3-5 seconds. This means that most of the dirty data are delayed by dirty_expire_centisecs centi-seconds, but the last few KiB are additionally delayed by 3-5 seconds.
UBIFS write-buffer
UBIFS is asynchronous file-system (read this section for more information). As other Linux file-system, it utilizes the page cache. The page cache is a generic Linux memory-management mechanism. It may be very large and cache a lot of data. When you write to a file, the data are written to the page cache, marked as dirty, and the write returns (unless the file is synchronous). Later the data are written-back.
Write-buffer is an additional UBIFS buffer, which is implemented inside UBIFS, and it sits between the page cache and the flash. This means that write-back actually writes to the write-buffer, not directly to the flash.
The write-buffer is designated to speed-up UBIFS on NAND flashes. NAND flashes consist of NAND pages, which are usually 512, 2KiB or 4KiB in size. NAND page is the minimal read/write unit of NAND flash (see this section).
Write-buffer size is equivalent to NAND page size (so it is tiny comparing to the page cache). It's purpose is to accumulate small writes, and write full NAND pages instead of partially filled. Indeed, imagine we have to write 4 512-byte nodes with half a second interval, and NAND page size is 2KiB. Without write-buffer we would have to write 4 NAND pages and waste 6KiB of flash space, while write-buffer allows us to write only once and waste nothing. This means we write less, we create less dirty space so UBIFS garbage collector will have to do less work, we save power.
Well, the example shows an ideal situation, and even with the write-buffer we may waste space, for example in case of synchronous I/O, or if the data arrives with long time intervals. This is because the write-buffer has an associated timer, which flushes it every 3-5 seconds, even if it isn't full. We do this for data integrity reasons.
Of course, when UBIFS has to write a lot of data, it does not use write buffer. Only the last part of the data which is smaller than the NAND page ends up in the write-buffer and waits more for data, until it is flushed by the timer.
The write-buffer implementation is a little more complex, and we actually have several of them - one for each journal head. But this does not change the basic idea behind the write-buffer.
Few notes with regards to synchronization:
"sync()" also synchronizes all write-buffers;
"fsync(fd)" also synchronizes all write-buffers which contain pieces of "fd";
synchronous files, as well as files opened with "O_SYNC", bypass write-buffers, so the I/O is indeed synchronous for this files;
write-buffers are also bypassed if the file-system is mounted with the "-o sync" mount option.
Take into account that write-buffers delay the data synchronization timeout defined by "dirty_expire_centisecs" (see here) by 3-5 seconds. However, since write-buffers are small, only few data are delayed.
UBIFS in synchronous mode vs JFFS2
When UBIFS is mounted in synchronous mode (-o sync mount options) - all file system operations become synchronous. This means that all data are written to flash before the file-system operations return.
For example, if you write 10MiB of data to a file f.dat using the write() call, and UBIFS is in synchronous mode, then UBIFS guarantees that all 10MiB of data and the meta-data (file size and date changes) will reach the flash media before write() returns. And if a power cut happens after the write() call returns, the file will contain the written data.
The same is true for situations when f.dat has was opened with O_SYNC or has the sync flag (see man 2 chattr).
It is well-known that the JFFS2 file-system is synchronous (except a small write-buffer). However, UBIFS in synchronous mode is not the same as JFFS2 and provides somewhat less guarantees that JFFS2 does with respect to sudden power cuts.
In JFFS2 all the meta-data (like inode atime/mtime/ctime, inode size, UID/GID, etc) are stored in the data node headers. Data nodes carry 4KiB of (compressed) data. This means that the meta-data information is duplicated in many places, but this also means that every time JFFS2 writes a data node to the flash media, it updates inode size as well. So when JFFS2 mounts it scans the flash media, finds the latest data node, and fetches the inode size from there.
In practice this means that JFFS2 will write these 10MiB of data sequentially, from the beginning to the end. And if you have a power cut, you will just lose some amount of data at the end of the inode. For example, if JFFS2 starts writing those 10MiB of data, write 5MiB, and a power cut happens, you will end up with a 5MiB f.dat file. You lose only the last 5MiB.
Things are a little bit more complex in case of UBIFS, where data are stored in data nodes and meta-data are stored in (separate) inode nodes. The meta-data are not duplicated in each data node, like in JFFS2. UBIFS never writes data nodes beyond the on-flash inode size. If it has to write a data node and the data node is beyond the on-flash inode size (the in-memory inode has up-to-data size, but it is dirty and was not flushed yet), then UBIFS first writes the inode to the media, and then it starts writing the data. And if you have an interrupt, you lose data nodes and you have holes (or old data nodes, if you are overwriting). Lets consider an example.
User creates an empty file f.dat. The file is synchronous, or UBIFS is mounted in synchronous mode. User calls the write() function with a 10MiB buffer.
The kernel first copies all 10MiB of the data to the page cache. Inode size is changed to 10MiB as well and the inode is marked as dirty. Nothing has been written to the flash media so far. If a power cut happens at this point, the user will end up with an empty f.dat file.
UBIFS sees that the I/O has to be synchronous, and starts synchronizing the inode. First of all, it writes the inode node to the flash media. If a power cut happens at this moment, the user will end up with a 10MiB file which contains no data (hole), and if he read this file, he will get 10MiB of zeroes.
UBIFS starts writing the data. If a power cut happens at this point, the user will end up with a 10MiB file containing a hole at the end.
Note, if the I/O was not synchronous, UBIFS would skip the last step and would just return. And the actual write-back would then happen in back-ground. But power cuts during write-back could anyway lead to files with holes at the end.
Thus, synchronous I/O in UBIFS provides less guarantees than JFFS2 I/O - UBIFS has an effect of holes at the end of files. In ideal world applications should not assume anything about the contents of files which were not synchronized before a power-cut has happened. And "mainstream" file-systems like ext3 do not provide JFSS2-like guarantees.
However, UBIFS is sometimes used as a JFFS2 replacement and people may want it to behave the same way as JFFS2 if it is mounted synchronously. This is doable, but needs some non-trivial development, so this was not implemented so far. On the other hand, there was no strong demand. You may implement this as an exercise, or you may try to convince UBIFS authors to do this.
Synchronization exceptions for buggy applications
As this section describes, UBIFS is an asynchronous file-system, and applications should synchronize their files whenever it is required. The same applies to most Linux file-systems, e.g. XFS.
However, many applications ignore this and do not synchronize files properly. And there was a huge war between user-space and kernel developers related to ext4 delayed allocation feature. Please, see the Theodore Tso's blog post. More information may be found in this LWN article.
In short, the flame war was about 2 cases. The first case was about the atomic re-name, where many user-space programs did not synchronize the copy before re-naming it. The second case was about applications which truncate files, then change them. There was no final agreement, but the "we cannot ignore the real world" argument found ext4 developers' understanding, and there were 2 ext4 changes which help both problems.
Roughly speaking, the first change made ext4 synchronize files on close if they were previously truncated. This was a hack from file-system point of view, but it "fixed" applications which truncate files, write new contents, and close the files without synchronizing them.
The second change made ext4 synchronize the renamed file.
Well, this is not exactly correct description, because ext4 does not write the files synchronously, but actually initiates asynchronous write-out of the files, so the performance hit is not very high. For the truncation case this means that the file is synchronized soon after it is closed. For the re-name case this means that ext4 writes data before it writes the re-name meta-data.
However, the application writers should never rely on these things, because this is not portable. Instead, they should properly synchronize files. The ext4 fixes were because there were many broken user-space applications in the wild already.
We have plans to implement these features in UBIFS, but this has not been done yet. The problem is that UBI/MTD are fully synchronous and we cannot initiate asynchronous write-out, so we'd have to synchronously write files on close/rename, which is slow. So implementing these features would require implementing asynchronous I/O in UBI, which is a big job. But feel free to do this :-).
同步内存中所有已修改的文件数据到储存设备。
用法:
#include <unistd.h>
int fsync(int fd);
参数:
fd:文件描述词。
返回说明:
成功执行时,返回0。失败返回-1,errno被设为以下的某个值
EBADF: 文件描述词无效
EIO : 读写的过程中发生错误
EROFS, EINVAL:文件所在的文件系统不支持同步
强制把系统缓存写入文件sync和fsync函数,, fflush和fsync的联系和区别2010-05-10 11:25传统的U N I X实现在内核中设有缓冲存储器,大多数磁盘I / O都通过缓存进行。当将数据写
到文件上时,通常该数据先由内核复制到缓存中,如果该缓存尚未写满,则并不将其排入输出
队列,而是等待其写满或者当内核需要重用该缓存以便存放其他磁盘块数据时,再将该缓存排
入输出队列,然后待其到达队首时,才进行实际的I / O操作。这种输出方式被称之为延迟写
(delayed write)(Bach 〔1 9 8 6〕第3章详细讨论了延迟写)。延迟写减少了磁盘读写次数,但是
第4章文件和目录8 7
下载
却降低了文件内容的更新速度,使得欲写到文件中的数据在一段时间内并没有写到磁盘上。当
系统发生故障时,这种延迟可能造成文件更新内容的丢失。为了保证磁盘上实际文件系统与缓
存中内容的一致性,U N I X系统提供了s y n c和f s y n c两个系统调用函数。
#include <unistd.h>
void sync(void);
int fsync(intf i l e d e s) ;
返回:若成功则为0,若出错则为-1
s y n c只是将所有修改过的块的缓存排入写队列,然后就返回,它并不等待实际I / O操作结束。
系统精灵进程(通常称为u p d a t e )一般每隔3 0秒调用一次s y n c函数。这就保证了定期刷新内
核的块缓存。命令s y n c ( 1 )也调用s y n c函数。
函数f s y n c只引用单个文件(由文件描述符f i l e d e s指定),它等待I / O结束,然后返回。f s y n c可
用于数据库这样的应用程序,它确保修改过的块立即写到磁盘上。比较一下f s y n c和O _ S Y N C标
志(见3 . 1 3节)。当调用f s y n c时,它更新文件的内容,而对于O _ S Y N C,则每次对文件调用w r i t e
函数时就更新文件的内容。
fflush和fsync的联系和区别
[zz ] http://blog.chinaunix.net/u2/73874/showart_1421917.html
1.提供者fflush是libc.a中提供的方法,fsync是系统提供的系统调用。2.原形fflush接受一个参数FILE *.fflush(FILE *);fsync接受的时一个Int型的文件描述符。fsync(int fd);3.功能fflush:是把C库中的缓冲调用write函数写到磁盘[其实是写到内核的缓冲区]。fsync:是把内核缓冲刷到磁盘上。
c库缓冲-----fflush---------〉内核缓冲--------fsync-----〉磁盘
再转一篇英文的
Write-back support
UBIFS supports write-back, which means that file changes do not go to the flash media straight away, but they are cached and go to the flash later, when it is absolutely necessary. This helps to greatly reduce the amount of I/O which results in better performance. Write-back caching is a standard technique which is used by most file systems like ext3 or XFS.
In contrast, JFFS2 does not have write-back support and all the JFFS2 file system changes go the flash synchronously. Well, this is not completely true and JFFS2 does have a small buffer of a NAND page size (if the underlying flash is NAND). This buffer contains last written data and is flushed once it is full. However, because the amount of cached data are very small, JFFS2 is very close to a synchronous file system.
Write-back support requires the application programmers to take extra care about synchronizing important files in time. Otherwise the files may corrupt or disappear in case of power-cuts, which happens very often in many embedded devices. Let's take a glimpse at Linux manual pages:
$ man 2 write
....
NOTES
A successful return from write() does not make any guarantee that data
has been committed to disk. In fact, on some buggy implementations, it
does not even guarantee that space has successfully been reserved for
the data. The only way to be sure is to call fsync(2) after you are
done writing all your data.
...
This is true for UBIFS (except of the "some buggy implementations" part, because UBIFS does reserves space for cached dirty data). This is also true for JFFS2, as well as for any other Linux file system.
However, some (perhaps not very good) user-space programmers do not take write-back into account. They do not read manual pages carefully. When such applications are used in embedded systems which run JFFS2 - they work fine, because JFFS2 is almost synchronous. Of course, the applications are buggy, but they appear to work well enough with JFFS2. But the bugs show up when UBIFS is used instead. Please, be careful and check/test your applications with respect to power cut tolerance if you switch from JFFS2 to UBIFS. The following is a list of useful hints and advices.
If you want to switch into synchronous mode, use the -o sync option when mounting UBIFS; however, the file system performance will drop - be careful; Also remember that UBIFS mounted in synchronous mode provides less guarantees than JFFS2 - refer this section for details.
Always keep in mind the above statement from the manual pages and run fsync() for all important files you change; of course, there is no need to synchronize "throw-away" temporary files; Just think how important is the file data and decide; and do not use fsync() unnecessarily, because this will hit the performance;
If you want to be more accurate, you may use fdatasync(), in which cases only data changes will be flushed, but not inode meta-data changes (e.g., "mtime" or permissions); this might be more optimal than using fsync() if the synchronization is done often, e.g., in a loop; otherwise just stick with fsync();
In shell, the sync command may be used, but it synchronizes whole file system which might be not very optimal; and there is a similar libc sync() function;
You may use the O_SYNC flag of the open() call; this will make sure all the data (but not meta-data) changes go to the media before the write() operation returns; but in general, it is better to use fsync(), because O_SYNC makes each write to be synchronous, while fsync() allows to accumulate many writes and synchronize them at once;
It is possible to make certain inodes to be synchronous by default by setting the "sync" inode flag; in a shell, the chattr +S command may be used; in C programs, use the FS_IOC_SETFLAGS ioctl command; Note, the mkfs.ubifs tool checks for the "sync" flag in the original FS tree, so the synchronous files in the original FS tree will be synchronous in the resulting UBIFS image.
Let us stress that the above items are true for any Linux file system, including JFFS2.
fsync() may be called for directories - it synchronizes the directory inode meta-data. The "sync" flag may also be set for directories to make the directory inode synchronous. But the flag is inherited, which means all new children of this directory will also have this flag. New files and sub-directories of this directory will also be synchronous, and their children, and so forth. This feature is very useful if one needs to create a whole sub-tree of synchronous files and directories, or to make all new children of some directory to be synchronous by default (e.g., /etc).
The fdatasync() call for directories is "no-op" in UBIFS and all UBIFS operations which change directory entries are synchronous. However, you should not assume this for portability (e.g., this is not true for ext2). Similarly, the "dirsync" inode flag has no effect in UBIFS.
The functions mentioned above work on file-descriptors, not on streams (FILE *). To synchronize a stream, you should first get its file descriptor using the fileno() libc function, then flush the stream using fflush(), and then synchronize the file using fsync() or fdatasync(). You may use other synchronization methods, but remember to flush the stream before synchronizing the file. The fflush() function flushes the libc-level buffers, while sync(), fsync(), etc flush kernel-level buffers.
Please, refer this FAQ entry for information about how to atomically update the contents of a file. Also, the Theodore Tso's article is a good reading.
Write-back knobs in Linux
Linux has several knobs in "/proc/sys/vm" which you may use to tune write-back. The knobs are global, so they affect all file-systems. Please, refer the "Documentation/sysctl/vm.txt" file fore more information. The file may be found in the Linux kernel source tree. Below are interesting knobs described in UBIFS context and in a simplified form.
dirty_writeback_centisecs - how often the Linux periodic write-back thread wakes up and writes out dirty data. This is a mechanism which makes sure all dirty data hits the media at some point.
dirty_expire_centisecs - dirty data expire period. This is maximum time data may stay dirty. After this period of time it will be written back by the Linux periodic write-back thread. IOW, the periodic write-back thread wakes up every "dirty_writeback_centisecs" centi-seconds and synchronizes data which was dirtied "dirty_expire_centisecs" centi-seconds ago.
dirty_background_ratio - maximum amount of dirty data in percent of total memory. When the amount of dirty data becomes larger, the periodic write-back thread starts synchronizing it until it becomes smaller. Even non-expired data will be synchronized. This may be used to set a "soft" limit for the amount of dirty data in the system.
dirty_ratio - maximum amount of dirty data at which writers will first synchronize the existing dirty data before adding more. IOW, this is a "hard" limit of the amount of dirty data in the system.
Note, UBIFS additionally has small write-buffers which are synchronized every 3-5 seconds. This means that most of the dirty data are delayed by dirty_expire_centisecs centi-seconds, but the last few KiB are additionally delayed by 3-5 seconds.
UBIFS write-buffer
UBIFS is asynchronous file-system (read this section for more information). As other Linux file-system, it utilizes the page cache. The page cache is a generic Linux memory-management mechanism. It may be very large and cache a lot of data. When you write to a file, the data are written to the page cache, marked as dirty, and the write returns (unless the file is synchronous). Later the data are written-back.
Write-buffer is an additional UBIFS buffer, which is implemented inside UBIFS, and it sits between the page cache and the flash. This means that write-back actually writes to the write-buffer, not directly to the flash.
The write-buffer is designated to speed-up UBIFS on NAND flashes. NAND flashes consist of NAND pages, which are usually 512, 2KiB or 4KiB in size. NAND page is the minimal read/write unit of NAND flash (see this section).
Write-buffer size is equivalent to NAND page size (so it is tiny comparing to the page cache). It's purpose is to accumulate small writes, and write full NAND pages instead of partially filled. Indeed, imagine we have to write 4 512-byte nodes with half a second interval, and NAND page size is 2KiB. Without write-buffer we would have to write 4 NAND pages and waste 6KiB of flash space, while write-buffer allows us to write only once and waste nothing. This means we write less, we create less dirty space so UBIFS garbage collector will have to do less work, we save power.
Well, the example shows an ideal situation, and even with the write-buffer we may waste space, for example in case of synchronous I/O, or if the data arrives with long time intervals. This is because the write-buffer has an associated timer, which flushes it every 3-5 seconds, even if it isn't full. We do this for data integrity reasons.
Of course, when UBIFS has to write a lot of data, it does not use write buffer. Only the last part of the data which is smaller than the NAND page ends up in the write-buffer and waits more for data, until it is flushed by the timer.
The write-buffer implementation is a little more complex, and we actually have several of them - one for each journal head. But this does not change the basic idea behind the write-buffer.
Few notes with regards to synchronization:
"sync()" also synchronizes all write-buffers;
"fsync(fd)" also synchronizes all write-buffers which contain pieces of "fd";
synchronous files, as well as files opened with "O_SYNC", bypass write-buffers, so the I/O is indeed synchronous for this files;
write-buffers are also bypassed if the file-system is mounted with the "-o sync" mount option.
Take into account that write-buffers delay the data synchronization timeout defined by "dirty_expire_centisecs" (see here) by 3-5 seconds. However, since write-buffers are small, only few data are delayed.
UBIFS in synchronous mode vs JFFS2
When UBIFS is mounted in synchronous mode (-o sync mount options) - all file system operations become synchronous. This means that all data are written to flash before the file-system operations return.
For example, if you write 10MiB of data to a file f.dat using the write() call, and UBIFS is in synchronous mode, then UBIFS guarantees that all 10MiB of data and the meta-data (file size and date changes) will reach the flash media before write() returns. And if a power cut happens after the write() call returns, the file will contain the written data.
The same is true for situations when f.dat has was opened with O_SYNC or has the sync flag (see man 2 chattr).
It is well-known that the JFFS2 file-system is synchronous (except a small write-buffer). However, UBIFS in synchronous mode is not the same as JFFS2 and provides somewhat less guarantees that JFFS2 does with respect to sudden power cuts.
In JFFS2 all the meta-data (like inode atime/mtime/ctime, inode size, UID/GID, etc) are stored in the data node headers. Data nodes carry 4KiB of (compressed) data. This means that the meta-data information is duplicated in many places, but this also means that every time JFFS2 writes a data node to the flash media, it updates inode size as well. So when JFFS2 mounts it scans the flash media, finds the latest data node, and fetches the inode size from there.
In practice this means that JFFS2 will write these 10MiB of data sequentially, from the beginning to the end. And if you have a power cut, you will just lose some amount of data at the end of the inode. For example, if JFFS2 starts writing those 10MiB of data, write 5MiB, and a power cut happens, you will end up with a 5MiB f.dat file. You lose only the last 5MiB.
Things are a little bit more complex in case of UBIFS, where data are stored in data nodes and meta-data are stored in (separate) inode nodes. The meta-data are not duplicated in each data node, like in JFFS2. UBIFS never writes data nodes beyond the on-flash inode size. If it has to write a data node and the data node is beyond the on-flash inode size (the in-memory inode has up-to-data size, but it is dirty and was not flushed yet), then UBIFS first writes the inode to the media, and then it starts writing the data. And if you have an interrupt, you lose data nodes and you have holes (or old data nodes, if you are overwriting). Lets consider an example.
User creates an empty file f.dat. The file is synchronous, or UBIFS is mounted in synchronous mode. User calls the write() function with a 10MiB buffer.
The kernel first copies all 10MiB of the data to the page cache. Inode size is changed to 10MiB as well and the inode is marked as dirty. Nothing has been written to the flash media so far. If a power cut happens at this point, the user will end up with an empty f.dat file.
UBIFS sees that the I/O has to be synchronous, and starts synchronizing the inode. First of all, it writes the inode node to the flash media. If a power cut happens at this moment, the user will end up with a 10MiB file which contains no data (hole), and if he read this file, he will get 10MiB of zeroes.
UBIFS starts writing the data. If a power cut happens at this point, the user will end up with a 10MiB file containing a hole at the end.
Note, if the I/O was not synchronous, UBIFS would skip the last step and would just return. And the actual write-back would then happen in back-ground. But power cuts during write-back could anyway lead to files with holes at the end.
Thus, synchronous I/O in UBIFS provides less guarantees than JFFS2 I/O - UBIFS has an effect of holes at the end of files. In ideal world applications should not assume anything about the contents of files which were not synchronized before a power-cut has happened. And "mainstream" file-systems like ext3 do not provide JFSS2-like guarantees.
However, UBIFS is sometimes used as a JFFS2 replacement and people may want it to behave the same way as JFFS2 if it is mounted synchronously. This is doable, but needs some non-trivial development, so this was not implemented so far. On the other hand, there was no strong demand. You may implement this as an exercise, or you may try to convince UBIFS authors to do this.
Synchronization exceptions for buggy applications
As this section describes, UBIFS is an asynchronous file-system, and applications should synchronize their files whenever it is required. The same applies to most Linux file-systems, e.g. XFS.
However, many applications ignore this and do not synchronize files properly. And there was a huge war between user-space and kernel developers related to ext4 delayed allocation feature. Please, see the Theodore Tso's blog post. More information may be found in this LWN article.
In short, the flame war was about 2 cases. The first case was about the atomic re-name, where many user-space programs did not synchronize the copy before re-naming it. The second case was about applications which truncate files, then change them. There was no final agreement, but the "we cannot ignore the real world" argument found ext4 developers' understanding, and there were 2 ext4 changes which help both problems.
Roughly speaking, the first change made ext4 synchronize files on close if they were previously truncated. This was a hack from file-system point of view, but it "fixed" applications which truncate files, write new contents, and close the files without synchronizing them.
The second change made ext4 synchronize the renamed file.
Well, this is not exactly correct description, because ext4 does not write the files synchronously, but actually initiates asynchronous write-out of the files, so the performance hit is not very high. For the truncation case this means that the file is synchronized soon after it is closed. For the re-name case this means that ext4 writes data before it writes the re-name meta-data.
However, the application writers should never rely on these things, because this is not portable. Instead, they should properly synchronize files. The ext4 fixes were because there were many broken user-space applications in the wild already.
We have plans to implement these features in UBIFS, but this has not been done yet. The problem is that UBI/MTD are fully synchronous and we cannot initiate asynchronous write-out, so we'd have to synchronously write files on close/rename, which is slow. So implementing these features would require implementing asynchronous I/O in UBI, which is a big job. But feel free to do this :-).
加载全部内容