Главная страница


ru.linux

 
 - RU.LINUX ---------------------------------------------------------------------
 From : Sergey Lentsov                       2:4615/71.10   13 Aug 2001  17:10:24
 To : All
 Subject : URL: http://www.lwn.net/2001/0809/kernel.php3
 -------------------------------------------------------------------------------- 
 
    [1][LWN Logo] 
    
                                [2]Click Here 
    [LWN.net]
    
    Sections:
     [3]Main page
     [4]Security
     Kernel
     [5]Distributions
     [6]On the Desktop
     [7]Development
     [8]Commerce
     [9]Linux in the news
     [10]Announcements
     [11]Linux History
     [12]Letters
    [13]All in one big page
    
    See also: [14]last week's Kernel page.
    
 Kernel development
 
    The current kernel release is still 2.4.7. Linus released 2.4.8pre7 on
    August 7, but, as of this writing, the changelog had not been updated
    from [15]the pre6 version. It contains more VM fixes (see below) and a
    number of other updates.
    
    Alan Cox's latest is [16]2.4.7ac10. It contains a vast number of fixes
    and updates, but the most interesting part may be the merging of the
    ext3 journaling filesystem, which happened in [17]2.4.7ac4. While ext3
    will likely not find its way into Linus's kernel for some time yet,
    its presence in the "ac" series is a firm step in that direction.
    
    On synchronous directory operations. If an application renames (or
    makes a link to) a file, how can it know when that operation has found
    its way to the physical device and will not disappear if the system
    crashes? Most applications are not too concerned about such issues, as
    long as their operations make it to persistent storage eventually. But
    there are exceptions. In particular, a number of mail transfer agents
    (MTAs), such as Postfix and qmail, depend heavily on link and rename
    operations for reliable delivery of mail. They need to know when these
    operations have completed.
    
    Many Unix-like systems, it seems, implement directory operations like
    link() in a synchronous manner. When link() returns, the operation has
    completed and will not disappear. Or, at least, any event that could
    cause it to disappear will be sufficiently severe that reliable mail
    delivery will be fairly low on the list of concerns. Linux (and the
    ext2 filesystem, in particular), however, performs directory
    operations asynchronously; they are buffered like most other
    filesystem operations. The result is better performance, at the cost
    of an increase in hair-pulling and grumbling from MTA authors.
    
    Said authors, who tend not to be quiet or reserved people, have been
    fairly clear on how they feel about Linux's directory operation
    semantics. There have been claims that the Single Unix Standard
    requires synchronous directory operations, but that appears to be
    [18]an issue upon which reasonable people can differ.
    
    The answer from the Linux developers has been that, if an application
    needs a directory operation to be synchronous, it needs to ask for
    those semantics explicitly. That can be done in several ways. One
    could simply mount the filesystem with the sync option, but that is so
    painfully slow that nobody is much interested in it. Another is to
    request synchronous operations on the directory in question with the
    ext2 chatter +S option. It works, but MTA authors seem to not like it,
    perhaps because it makes all operations synchronous, even those which
    do not need to be. Finally, an application can open the directory in
    question and use fsync() to explicitly synchronize any outstanding
    operations there.
    
    The fsync() option seems like the best, since it lets the application
    say when the synchronization must happen. But MTA authors grumble
    again and some, at least, refuse to do it. The complaint is that it's
    a special, nonportable coding requirement imposed only by Linux.
    
    What people would like to see, it seems, is one or both of the
    following:
      * An fsync() operation on a file also synchronizes directory entries
        belonging to that file. These semantics are difficult to implement
        in the general sense - file names are distinct from the files
        themselves, and a file can have more than one of them. Linus has
        [19]pointed out a possible solution, however, that could work in
        this particular situation.
      * A new mount option, called something like dirsync, that would
        cause directory operations to be synchronous. Nobody has posted a
        patch to do this yet, but one may well be forthcoming.
        
    This whole issue is a classic confrontation between groups of
    developers with strong ideas of how things should be done. In the end,
    however, Linux hackers want their platform to work well for mail
    delivery, while MTA authors would be happy if their applications
    worked properly on Linux. Some sort of solution should be achievable
    here.
    
    Who maintains the Linux sound drivers? While people toss in a patch
    occasionally, it turns out that nobody is currently taking the role of
    the maintainer of the Linux sound drivers, and the Open Sound System
    (OSS) drivers in particular. Not much has changed with those drivers
    in some time, and all of the serious sound hackers have been off
    bashing on [20]ALSA for some time now.
    
    ALSA is expected to replace OSS in the 2.5 development kernel. As a
    result, one can detect a certain "why bother?" attitude in the air
    when OSS maintenance is discussed. The fact remains, however, that OSS
    will remain the standard sound driver in the 2.4 kernel; swapping in
    ALSA would be too big a change for a stable kernel series. Even the
    2.4 series. So somebody really should be keeping an eye on it for a
    little while yet...
    
    Chasing the virtual memory problems. Virtual memory performance in
    2.4.x is still widely considered to be poor; it is, perhaps, the
    single largest outstanding problem with the 2.4 series. The effort to
    improve VM performance got some new energy this week when Ben LaHaise
    [21]took a look at the problem. While Ben didn't actually nail down
    any VM bugs himself, his work was crucial in directing the attention
    of some of the other VM hackers - and Linus - to the right place.
    While Linux may not be out of the VM woods yet, some real problems
    have been found and fixed in the recent prepatches.
    
    Ben's investigation showed that there were problems in how the kernel
    throttles write requests. There was some code in place which attempted
    to keep disk writes from overwhelming the system, but it did not work
    quite as intended. Instead, it had the effect of allowing the write
    queue(s) to grow to great lengths while, simultaneously, allowing an
    aggressive writer to keep other processes from submitting I/O requests
    for long periods of time. The long queues take up a lot of memory, of
    course. They also could reach a length where even a very fast drive
    could not perform all of the queued operations within, in some cases,
    multiple seconds. An interactive process could thus find itself unable
    to queue a request for some time, then waiting, again, for an
    operation that ended up at the wrong end of a very long queue.
    
    The solution involves a couple of separate tweaks:
      * The old throttling code is [22]simply removed, since it created
        fairness problems without actually solving the problems.
      * The maximum length of an I/O request queue is drastically reduced.
        This reduces the maximum latency that any individual request
        should experience, while, perhaps, reducing the effectiveness of
        the elevator algorithm slightly. This change also moves write
        throttling to the request allocation stage, which, it is hoped,
        should solve that problem in a more fair and resource-efficient
        manner.
        
    There are also, as it turns out, some problems with how the 2.4 kernel
    accounts for memory. Marcelo Tosatti has put in some patches to fix
    how the amount of free memory in each zone is calculated. And Linus
    [23]found a bug in how the kernel decided how much memory it could use
    for I/O buffers. These problems, too, could allow the system to be
    overwhelmed by write operations that really should have been
    throttled.
    
    Many of these fixes have [24]gone into 2.4.8pre4 and subsequent
    releases; Alan Cox seems to be holding off on putting them into his
    series at this point. There are some good initial reports, but more
    testing (and more work) will certainly be required.
    
    Rik van Riel, meanwhile, has [25]posted a patch which should make
    2.4.8 much friendlier to systems without large amounts of swap space.
    The current kernel, remember, keeps a page in swap even after it has
    been paged back into main memory. There are certain performance
    benefits to doing so, but systems with small swap areas can run out of
    swap space easily. And a system that has run out of swap is not a
    friendly place to work. Hopefully that problem is now a thing of the
    past.
    
    Buried in VMAs. The Linux kernel makes use of "virtual memory areas"
    (VMAs) to keep track of the larger chunks of memory in use by any
    process. One VMA is associated with one range of memory all using the
    same source or backing store and the same access permissions. Thus,
    for example, loading a shareable library will generally create at
    least two VMAs: one for the library code, and one for its associated
    data area.
    
    For a relatively simple example of how VMAs are set up, type:
   cat /proc/self/maps
 
    to see the VMAs used by the cat command itself.
    
    There are reasons for wanting to keep the number of VMAs under
    control. Each VMA requires a data structure in the kernel, so large
    numbers of VMAs will take up a significant amount of kernel memory. It
    is also often necessary to be able to find a specific VMA in a hurry.
    For example, when a page fault occurs, the kernel must locate the VMA
    describing the faulting address so that the fault can be resolved. The
    VMA lookup routine is reasonably efficient, but performance will still
    suffer if VMAs grow without bound. Normally there is no problem here;
    the emacs process being used to type this text - which is not a small
    process - has 53 virtual memory areas in use, which is a reasonable
    number. Netscape uses 64 VMAs.
    
    Recently, however, Chris Wedgewood [26]noticed that Mozilla was
    running rather sluggishly. Yes, lots of Mozilla users notice that, but
    this was a more severe than usual case. A quick look, via the handy
    /proc interface, showed that the process had over 5,000 VMAs currently
    mapped. That is more than enough to affect the performance of the
    Mozilla process, and the system as a whole. Other GNOME applications,
    such as evolution, show similar patterns.
    
    Your editor runs Galeon, which, as everybody knows, is a much lighter
    program. And, in fact, it is, as of this writing, running within a
    svelte 1474 VMAs. Better, but still far too many. But the real
    problem, as has been discussed on the kernel list, can be seen if you
    look at [27]the actual VMA mappings. Here is an excerpt:
   40c52000-40c5a000 rw-p 000bd000 00:00 0
   40c5a000-40c61000 rw-p 000c5000 00:00 0
   40c61000-40c69000 rw-p 000cc000 00:00 0
   40c69000-40c71000 rw-p 000d4000 00:00 0
   40c71000-40c74000 rw-p 000dc000 00:00 0
 
    The pair of hexadecimal addresses on the left is the virtual address
    range covered by each VMA. A quick look shows that most of Galeon's
    VMAs are simple anonymous memory pages, and that they are contiguous.
    In other words, they could be represented by a single VMA rather than
    hundreds or thousands.
    
    The Linux kernel makes an attempt to merge contiguous VMAs when it is
    relatively easy to do. But the more comprehensive merging code that
    2.2 had has been abandoned, with the reasoning that (1) it is only
    useful in very rare cases, and (2) it is extremely difficult to get
    right. There is very little enthusiasm for thrashing up the VMA
    merging code again without compelling evidence that it is really
    necessary. Which means there is a need for an understanding of just
    what is going on to cause this kind of behavior.
    
    To this end, Mr. Wedgewood performed [28]a detailed analysis of the
    system call pattern that brings about the explosion of VMAs. The
    problem, it seems, is with the malloc() implementation in the C
    library, which plays some tricky and complicated games with memory
    allocation. In particular, it does a lot of memory mapping, followed
    by partial unmapping for alignment purposes, and, crucially, changes
    to memory protection as segments of memory are parceled out.
    
    The C library plays with protections, presumably, in an attempt to
    catch overruns of allocated memory. But, if you change the protection
    on a subsection of a VMA, that VMA must be split into two,
    independently protected VMAs. When the kernel does this split, it
    could attempt to merge the newly protected VMA with those next to it,
    but currently does not. The result is, for certain memory allocation
    patterns, lots of VMAs.
    
    It's possible that a patch will emerge which makes mprotect() perform
    VMA merging. But there appears to also be a certain inclination among
    the kernel hackers to blame the problem on the C library and forget
    about it. Relations across the kernel-glibc divide are not always the
    best, and it is precisely this sort of issue that can create
    disagreements. But, until one side or the other makes a change, some
    applications are going to run sluggishly under 2.4.
    
    Other patches and updates released this week include:
    
      * Alexander Viro decided he was tired of waiting and [29]submitted a
        patch fixing a race condition in devfs. Richard Gooch [30]didn't
        like the fix. What followed started at the name-calling level, but
        then evolved into a productive technical discussion. One result is
        new [31]devfs and [32]devfsd releases from Richard; expect more in
        the near future.
      * The [33]first release of the 2.5 kernel build system has been
        announced by Keith Owens. See the announcement for a detailed
        description of this release.
      * Also from Keith: [34]a proposal to change the way /proc/ksyms
        works on the IA64 architecture (and, presumably, others that use
        function descriptors).
      * Richard Gooch has [35]a new version of his patch which allows the
        2.4 kernel (with devfs) to support up to 2144 SCSI devices.
      * Matthew Macleod has [36]posted a version of the international
        crypto patch for 2.4.7. Jari Ruusu, meanwhile, has released
        [37]loop-AES-v1.3d, which is just the file encryption part of the
        international crypto patch.
      * A new Compaq Hotplug PCI driver was [38]released by Greg
        Kroah-Hartman.
      * IBM has released [39]version 1.0.2 of its journaling filesystem.
      * Etienne Lorrain has [40]announced version 0.4 of his "Gujin"
        bootloader.
      * Alexander Viro has implemented [41]a general parser for mount
        options which, he hopes, will help to generalize and clean up the
        option handling in the various filesystems supported by Linux.
      * Mike Kravetz and associates have posted [42]a scalable scheduler
        patch which addresses some of the scheduling problems seen on
        larger systems (see [43]our OLS coverage for details). Linus
        [44]didn't like the patch, but his objections had more to do with
        coding style than the actual changes made. A new version should be
        forthcoming soon.
      * [45]A new security module patch has been posted by Greg
        Kroah-Hartman.
      * Andreas Gruenbacher has released [46]version 0.7.15 of the access
        control list (ACL) patch.
      * HP has released [47]version 0.8 of the HP OfficeJet driver.
        
    Section Editor: [48]Jonathan Corbet
    August 9, 2001
    
    For other kernel news, see:
      * [49]Kernel traffic
      * [50]Kernel Newsflash
      * [51]Kernel Trap
    
    Other resources:
      * [52]Kernel Source Reference
      * [53]L-K mailing list FAQ
      * [54]Linux-MM
      * [55]Linux Scalability Effort
      * [56]Kernel Newbies
      * [57]Linux Device Drivers
    
    
    
                                                   [58]Next: Distributions
    
    [59]Eklektix, Inc. Linux powered! Copyright Л 2001 [60]Eklektix, Inc.,
    all rights reserved
    Linux (R) is a registered trademark of Linus Torvalds
 
 References
 
    1. http://lwn.net/
    2. http://ads.tucows.com/click.ng/pageid=001-012-132-000-000-003-000-000-012
    3. http://lwn.net/2001/0809/
    4. http://lwn.net/2001/0809/security.php3
    5. http://lwn.net/2001/0809/dists.php3
    6. http://lwn.net/2001/0809/desktop.php3
    7. http://lwn.net/2001/0809/devel.php3
    8. http://lwn.net/2001/0809/commerce.php3
    9. http://lwn.net/2001/0809/press.php3
   10. http://lwn.net/2001/0809/announce.php3
   11. http://lwn.net/2001/0809/history.php3
   12. http://lwn.net/2001/0809/letters.php3
   13. http://lwn.net/2001/0809/bigpage.php3
   14. http://lwn.net/2001/0802/kernel.php3
   15. http://lwn.net/2001/0809/a/2.4.8pre6.php3
   16. http://lwn.net/2001/0809/a/2.4.7ac10.php3
   17. http://lwn.net/2001/0809/a/2.4.7ac4.php3
   18. http://lwn.net/2001/0809/a/sus.php3
   19. http://lwn.net/2001/0809/a/lt-fsync.php3
   20. http://www.alsa-project.org/
   21. http://lwn.net/2001/0809/a/bcrl-vm.php3
   22. http://lwn.net/2001/0809/a/lt-ll_rw_block.php3
   23. http://lwn.net/2001/0809/a/lt-zone.php3
   24. http://lwn.net/2001/0809/a/lt-pre4-vm.php3
   25. http://lwn.net/2001/0809/a/rvr-swap.php3
   26. http://lwn.net/2001/0809/a/mozilla-vmas.php3
   27. http://lwn.net/2001/0809/a/galeon-vmas.php3
   28. http://lwn.net/2001/0809/a/vma-analysis.php3
   29. http://lwn.net/2001/0809/a/devfs-race-fix.php3
   30. http://lwn.net/2001/0809/a/rg-race-fix.php3
   31. http://lwn.net/2001/0809/a/devfs.php3
   32. http://lwn.net/2001/0809/a/devfsd.php3
   33. http://lwn.net/2001/0809/a/kbuild.php3
   34. http://lwn.net/2001/0809/a/ia64-ksyms.php3
   35. http://lwn.net/2001/0809/a/lotsa-scsi.php3
   36. http://lwn.net/2001/0809/a/crypto.php3
   37. http://lwn.net/2001/0809/a/file-crypto.php3
   38. http://lwn.net/2001/0809/a/compaq-hotplug.php3
   39. http://lwn.net/2001/0809/a/jfs.php3
   40. http://lwn.net/2001/0809/a/gujin.php3
   41. http://lwn.net/2001/0809/a/mount-parser.php3
   42. http://lwn.net/2001/0809/a/scheduler.php3
   43. http://lwn.net/2001/features/OLS/
   44. http://lwn.net/2001/0809/a/lt-scheduler.php3
   45. http://lwn.net/2001/0809/a/sm.php3
   46. http://lwn.net/2001/0809/a/acl.php3
   47. http://lwn.net/2001/0809/a/hpoj.php3
   48. mailto:lwn@lwn.net
   49. http://kt.zork.net/
   50. http://www.atnf.csiro.au/~rgooch/linux/docs/kernel-newsflash.html
   51. http://www.kerneltrap.com/
   52. http://lksr.org/
   53. http://www.tux.org/lkml/
   54. http://www.linux.eu.org/Linux-MM/
   55. http://lse.sourceforge.net/
   56. http://www.kernelnewbies.org/
   57. http://www.xml.com/ldd/chapter/book/index.html
   58. http://lwn.net/2001/0809/dists.php3
   59. http://www.eklektix.com/
   60. http://www.eklektix.com/
 
 --- ifmail v.2.14.os7-aks1
  * Origin: Unknown (2:4615/71.10@fidonet)
 
 

Вернуться к списку тем, сортированных по: возрастание даты  уменьшение даты  тема  автор 

 Тема:    Автор:    Дата:  
 URL: http://www.lwn.net/2001/0809/kernel.php3   Sergey Lentsov   13 Aug 2001 17:10:24 
Архивное /ru.linux/19861267ef316.html, оценка 2 из 5, голосов 10
Яндекс.Метрика
Valid HTML 4.01 Transitional