|
|
ru.linux- RU.LINUX --------------------------------------------------------------------- From : Sergey Lentsov 2:4615/71.10 13 Aug 2001 17:10:24 To : All Subject : URL: http://www.lwn.net/2001/0809/kernel.php3 --------------------------------------------------------------------------------
[1][LWN Logo]
[2]Click Here
[LWN.net]
Sections:
[3]Main page
[4]Security
Kernel
[5]Distributions
[6]On the Desktop
[7]Development
[8]Commerce
[9]Linux in the news
[10]Announcements
[11]Linux History
[12]Letters
[13]All in one big page
See also: [14]last week's Kernel page.
Kernel development
The current kernel release is still 2.4.7. Linus released 2.4.8pre7 on
August 7, but, as of this writing, the changelog had not been updated
from [15]the pre6 version. It contains more VM fixes (see below) and a
number of other updates.
Alan Cox's latest is [16]2.4.7ac10. It contains a vast number of fixes
and updates, but the most interesting part may be the merging of the
ext3 journaling filesystem, which happened in [17]2.4.7ac4. While ext3
will likely not find its way into Linus's kernel for some time yet,
its presence in the "ac" series is a firm step in that direction.
On synchronous directory operations. If an application renames (or
makes a link to) a file, how can it know when that operation has found
its way to the physical device and will not disappear if the system
crashes? Most applications are not too concerned about such issues, as
long as their operations make it to persistent storage eventually. But
there are exceptions. In particular, a number of mail transfer agents
(MTAs), such as Postfix and qmail, depend heavily on link and rename
operations for reliable delivery of mail. They need to know when these
operations have completed.
Many Unix-like systems, it seems, implement directory operations like
link() in a synchronous manner. When link() returns, the operation has
completed and will not disappear. Or, at least, any event that could
cause it to disappear will be sufficiently severe that reliable mail
delivery will be fairly low on the list of concerns. Linux (and the
ext2 filesystem, in particular), however, performs directory
operations asynchronously; they are buffered like most other
filesystem operations. The result is better performance, at the cost
of an increase in hair-pulling and grumbling from MTA authors.
Said authors, who tend not to be quiet or reserved people, have been
fairly clear on how they feel about Linux's directory operation
semantics. There have been claims that the Single Unix Standard
requires synchronous directory operations, but that appears to be
[18]an issue upon which reasonable people can differ.
The answer from the Linux developers has been that, if an application
needs a directory operation to be synchronous, it needs to ask for
those semantics explicitly. That can be done in several ways. One
could simply mount the filesystem with the sync option, but that is so
painfully slow that nobody is much interested in it. Another is to
request synchronous operations on the directory in question with the
ext2 chatter +S option. It works, but MTA authors seem to not like it,
perhaps because it makes all operations synchronous, even those which
do not need to be. Finally, an application can open the directory in
question and use fsync() to explicitly synchronize any outstanding
operations there.
The fsync() option seems like the best, since it lets the application
say when the synchronization must happen. But MTA authors grumble
again and some, at least, refuse to do it. The complaint is that it's
a special, nonportable coding requirement imposed only by Linux.
What people would like to see, it seems, is one or both of the
following:
* An fsync() operation on a file also synchronizes directory entries
belonging to that file. These semantics are difficult to implement
in the general sense - file names are distinct from the files
themselves, and a file can have more than one of them. Linus has
[19]pointed out a possible solution, however, that could work in
this particular situation.
* A new mount option, called something like dirsync, that would
cause directory operations to be synchronous. Nobody has posted a
patch to do this yet, but one may well be forthcoming.
This whole issue is a classic confrontation between groups of
developers with strong ideas of how things should be done. In the end,
however, Linux hackers want their platform to work well for mail
delivery, while MTA authors would be happy if their applications
worked properly on Linux. Some sort of solution should be achievable
here.
Who maintains the Linux sound drivers? While people toss in a patch
occasionally, it turns out that nobody is currently taking the role of
the maintainer of the Linux sound drivers, and the Open Sound System
(OSS) drivers in particular. Not much has changed with those drivers
in some time, and all of the serious sound hackers have been off
bashing on [20]ALSA for some time now.
ALSA is expected to replace OSS in the 2.5 development kernel. As a
result, one can detect a certain "why bother?" attitude in the air
when OSS maintenance is discussed. The fact remains, however, that OSS
will remain the standard sound driver in the 2.4 kernel; swapping in
ALSA would be too big a change for a stable kernel series. Even the
2.4 series. So somebody really should be keeping an eye on it for a
little while yet...
Chasing the virtual memory problems. Virtual memory performance in
2.4.x is still widely considered to be poor; it is, perhaps, the
single largest outstanding problem with the 2.4 series. The effort to
improve VM performance got some new energy this week when Ben LaHaise
[21]took a look at the problem. While Ben didn't actually nail down
any VM bugs himself, his work was crucial in directing the attention
of some of the other VM hackers - and Linus - to the right place.
While Linux may not be out of the VM woods yet, some real problems
have been found and fixed in the recent prepatches.
Ben's investigation showed that there were problems in how the kernel
throttles write requests. There was some code in place which attempted
to keep disk writes from overwhelming the system, but it did not work
quite as intended. Instead, it had the effect of allowing the write
queue(s) to grow to great lengths while, simultaneously, allowing an
aggressive writer to keep other processes from submitting I/O requests
for long periods of time. The long queues take up a lot of memory, of
course. They also could reach a length where even a very fast drive
could not perform all of the queued operations within, in some cases,
multiple seconds. An interactive process could thus find itself unable
to queue a request for some time, then waiting, again, for an
operation that ended up at the wrong end of a very long queue.
The solution involves a couple of separate tweaks:
* The old throttling code is [22]simply removed, since it created
fairness problems without actually solving the problems.
* The maximum length of an I/O request queue is drastically reduced.
This reduces the maximum latency that any individual request
should experience, while, perhaps, reducing the effectiveness of
the elevator algorithm slightly. This change also moves write
throttling to the request allocation stage, which, it is hoped,
should solve that problem in a more fair and resource-efficient
manner.
There are also, as it turns out, some problems with how the 2.4 kernel
accounts for memory. Marcelo Tosatti has put in some patches to fix
how the amount of free memory in each zone is calculated. And Linus
[23]found a bug in how the kernel decided how much memory it could use
for I/O buffers. These problems, too, could allow the system to be
overwhelmed by write operations that really should have been
throttled.
Many of these fixes have [24]gone into 2.4.8pre4 and subsequent
releases; Alan Cox seems to be holding off on putting them into his
series at this point. There are some good initial reports, but more
testing (and more work) will certainly be required.
Rik van Riel, meanwhile, has [25]posted a patch which should make
2.4.8 much friendlier to systems without large amounts of swap space.
The current kernel, remember, keeps a page in swap even after it has
been paged back into main memory. There are certain performance
benefits to doing so, but systems with small swap areas can run out of
swap space easily. And a system that has run out of swap is not a
friendly place to work. Hopefully that problem is now a thing of the
past.
Buried in VMAs. The Linux kernel makes use of "virtual memory areas"
(VMAs) to keep track of the larger chunks of memory in use by any
process. One VMA is associated with one range of memory all using the
same source or backing store and the same access permissions. Thus,
for example, loading a shareable library will generally create at
least two VMAs: one for the library code, and one for its associated
data area.
For a relatively simple example of how VMAs are set up, type:
cat /proc/self/maps
to see the VMAs used by the cat command itself.
There are reasons for wanting to keep the number of VMAs under
control. Each VMA requires a data structure in the kernel, so large
numbers of VMAs will take up a significant amount of kernel memory. It
is also often necessary to be able to find a specific VMA in a hurry.
For example, when a page fault occurs, the kernel must locate the VMA
describing the faulting address so that the fault can be resolved. The
VMA lookup routine is reasonably efficient, but performance will still
suffer if VMAs grow without bound. Normally there is no problem here;
the emacs process being used to type this text - which is not a small
process - has 53 virtual memory areas in use, which is a reasonable
number. Netscape uses 64 VMAs.
Recently, however, Chris Wedgewood [26]noticed that Mozilla was
running rather sluggishly. Yes, lots of Mozilla users notice that, but
this was a more severe than usual case. A quick look, via the handy
/proc interface, showed that the process had over 5,000 VMAs currently
mapped. That is more than enough to affect the performance of the
Mozilla process, and the system as a whole. Other GNOME applications,
such as evolution, show similar patterns.
Your editor runs Galeon, which, as everybody knows, is a much lighter
program. And, in fact, it is, as of this writing, running within a
svelte 1474 VMAs. Better, but still far too many. But the real
problem, as has been discussed on the kernel list, can be seen if you
look at [27]the actual VMA mappings. Here is an excerpt:
40c52000-40c5a000 rw-p 000bd000 00:00 0
40c5a000-40c61000 rw-p 000c5000 00:00 0
40c61000-40c69000 rw-p 000cc000 00:00 0
40c69000-40c71000 rw-p 000d4000 00:00 0
40c71000-40c74000 rw-p 000dc000 00:00 0
The pair of hexadecimal addresses on the left is the virtual address
range covered by each VMA. A quick look shows that most of Galeon's
VMAs are simple anonymous memory pages, and that they are contiguous.
In other words, they could be represented by a single VMA rather than
hundreds or thousands.
The Linux kernel makes an attempt to merge contiguous VMAs when it is
relatively easy to do. But the more comprehensive merging code that
2.2 had has been abandoned, with the reasoning that (1) it is only
useful in very rare cases, and (2) it is extremely difficult to get
right. There is very little enthusiasm for thrashing up the VMA
merging code again without compelling evidence that it is really
necessary. Which means there is a need for an understanding of just
what is going on to cause this kind of behavior.
To this end, Mr. Wedgewood performed [28]a detailed analysis of the
system call pattern that brings about the explosion of VMAs. The
problem, it seems, is with the malloc() implementation in the C
library, which plays some tricky and complicated games with memory
allocation. In particular, it does a lot of memory mapping, followed
by partial unmapping for alignment purposes, and, crucially, changes
to memory protection as segments of memory are parceled out.
The C library plays with protections, presumably, in an attempt to
catch overruns of allocated memory. But, if you change the protection
on a subsection of a VMA, that VMA must be split into two,
independently protected VMAs. When the kernel does this split, it
could attempt to merge the newly protected VMA with those next to it,
but currently does not. The result is, for certain memory allocation
patterns, lots of VMAs.
It's possible that a patch will emerge which makes mprotect() perform
VMA merging. But there appears to also be a certain inclination among
the kernel hackers to blame the problem on the C library and forget
about it. Relations across the kernel-glibc divide are not always the
best, and it is precisely this sort of issue that can create
disagreements. But, until one side or the other makes a change, some
applications are going to run sluggishly under 2.4.
Other patches and updates released this week include:
* Alexander Viro decided he was tired of waiting and [29]submitted a
patch fixing a race condition in devfs. Richard Gooch [30]didn't
like the fix. What followed started at the name-calling level, but
then evolved into a productive technical discussion. One result is
new [31]devfs and [32]devfsd releases from Richard; expect more in
the near future.
* The [33]first release of the 2.5 kernel build system has been
announced by Keith Owens. See the announcement for a detailed
description of this release.
* Also from Keith: [34]a proposal to change the way /proc/ksyms
works on the IA64 architecture (and, presumably, others that use
function descriptors).
* Richard Gooch has [35]a new version of his patch which allows the
2.4 kernel (with devfs) to support up to 2144 SCSI devices.
* Matthew Macleod has [36]posted a version of the international
crypto patch for 2.4.7. Jari Ruusu, meanwhile, has released
[37]loop-AES-v1.3d, which is just the file encryption part of the
international crypto patch.
* A new Compaq Hotplug PCI driver was [38]released by Greg
Kroah-Hartman.
* IBM has released [39]version 1.0.2 of its journaling filesystem.
* Etienne Lorrain has [40]announced version 0.4 of his "Gujin"
bootloader.
* Alexander Viro has implemented [41]a general parser for mount
options which, he hopes, will help to generalize and clean up the
option handling in the various filesystems supported by Linux.
* Mike Kravetz and associates have posted [42]a scalable scheduler
patch which addresses some of the scheduling problems seen on
larger systems (see [43]our OLS coverage for details). Linus
[44]didn't like the patch, but his objections had more to do with
coding style than the actual changes made. A new version should be
forthcoming soon.
* [45]A new security module patch has been posted by Greg
Kroah-Hartman.
* Andreas Gruenbacher has released [46]version 0.7.15 of the access
control list (ACL) patch.
* HP has released [47]version 0.8 of the HP OfficeJet driver.
Section Editor: [48]Jonathan Corbet
August 9, 2001
For other kernel news, see:
* [49]Kernel traffic
* [50]Kernel Newsflash
* [51]Kernel Trap
Other resources:
* [52]Kernel Source Reference
* [53]L-K mailing list FAQ
* [54]Linux-MM
* [55]Linux Scalability Effort
* [56]Kernel Newbies
* [57]Linux Device Drivers
[58]Next: Distributions
[59]Eklektix, Inc. Linux powered! Copyright Л 2001 [60]Eklektix, Inc.,
all rights reserved
Linux (R) is a registered trademark of Linus Torvalds
References
1. http://lwn.net/
2. http://ads.tucows.com/click.ng/pageid=001-012-132-000-000-003-000-000-012
3. http://lwn.net/2001/0809/
4. http://lwn.net/2001/0809/security.php3
5. http://lwn.net/2001/0809/dists.php3
6. http://lwn.net/2001/0809/desktop.php3
7. http://lwn.net/2001/0809/devel.php3
8. http://lwn.net/2001/0809/commerce.php3
9. http://lwn.net/2001/0809/press.php3
10. http://lwn.net/2001/0809/announce.php3
11. http://lwn.net/2001/0809/history.php3
12. http://lwn.net/2001/0809/letters.php3
13. http://lwn.net/2001/0809/bigpage.php3
14. http://lwn.net/2001/0802/kernel.php3
15. http://lwn.net/2001/0809/a/2.4.8pre6.php3
16. http://lwn.net/2001/0809/a/2.4.7ac10.php3
17. http://lwn.net/2001/0809/a/2.4.7ac4.php3
18. http://lwn.net/2001/0809/a/sus.php3
19. http://lwn.net/2001/0809/a/lt-fsync.php3
20. http://www.alsa-project.org/
21. http://lwn.net/2001/0809/a/bcrl-vm.php3
22. http://lwn.net/2001/0809/a/lt-ll_rw_block.php3
23. http://lwn.net/2001/0809/a/lt-zone.php3
24. http://lwn.net/2001/0809/a/lt-pre4-vm.php3
25. http://lwn.net/2001/0809/a/rvr-swap.php3
26. http://lwn.net/2001/0809/a/mozilla-vmas.php3
27. http://lwn.net/2001/0809/a/galeon-vmas.php3
28. http://lwn.net/2001/0809/a/vma-analysis.php3
29. http://lwn.net/2001/0809/a/devfs-race-fix.php3
30. http://lwn.net/2001/0809/a/rg-race-fix.php3
31. http://lwn.net/2001/0809/a/devfs.php3
32. http://lwn.net/2001/0809/a/devfsd.php3
33. http://lwn.net/2001/0809/a/kbuild.php3
34. http://lwn.net/2001/0809/a/ia64-ksyms.php3
35. http://lwn.net/2001/0809/a/lotsa-scsi.php3
36. http://lwn.net/2001/0809/a/crypto.php3
37. http://lwn.net/2001/0809/a/file-crypto.php3
38. http://lwn.net/2001/0809/a/compaq-hotplug.php3
39. http://lwn.net/2001/0809/a/jfs.php3
40. http://lwn.net/2001/0809/a/gujin.php3
41. http://lwn.net/2001/0809/a/mount-parser.php3
42. http://lwn.net/2001/0809/a/scheduler.php3
43. http://lwn.net/2001/features/OLS/
44. http://lwn.net/2001/0809/a/lt-scheduler.php3
45. http://lwn.net/2001/0809/a/sm.php3
46. http://lwn.net/2001/0809/a/acl.php3
47. http://lwn.net/2001/0809/a/hpoj.php3
48. mailto:lwn@lwn.net
49. http://kt.zork.net/
50. http://www.atnf.csiro.au/~rgooch/linux/docs/kernel-newsflash.html
51. http://www.kerneltrap.com/
52. http://lksr.org/
53. http://www.tux.org/lkml/
54. http://www.linux.eu.org/Linux-MM/
55. http://lse.sourceforge.net/
56. http://www.kernelnewbies.org/
57. http://www.xml.com/ldd/chapter/book/index.html
58. http://lwn.net/2001/0809/dists.php3
59. http://www.eklektix.com/
60. http://www.eklektix.com/
--- ifmail v.2.14.os7-aks1
* Origin: Unknown (2:4615/71.10@fidonet)
Вернуться к списку тем, сортированных по: возрастание даты уменьшение даты тема автор
Архивное /ru.linux/19861267ef316.html, оценка из 5, голосов 10
|