Главная страница


ru.website

 
 - RU.WEBSITE -------------------------------------------------------------------
 From : Vladislav Zlobin                     2:5011/13.33   06 Feb 2001  16:31:38
 To : Andrey Kr.
 Subject : Робот. Кто он? :)
 -------------------------------------------------------------------------------- 
 
 
 On 06/Feb/01 at 13:22 you wrote:
 
  AK> Такой вопрос - как правильно написать файл robot.txt ?
 
                       Web Server Administrator's Guide
                       to the Robots Exclusion Protocol
                                       
    This guide is aimed at Web Server Administrators who want to use the
    Robots Exclusion Protocol.
    
    Note that this is not a specification -- for details and formal syntax
    and definition see the specification.
    
 Introduction
 
    The Robots Exclusion Protocol is very straightforward. In a nutshell
    it works like this:
    
    When a compliant Web Robot vists a site, it first checks for a
    "/robots.txt" URL on the site. If this URL exists, the Robot parses
    its contents for directives that instruct the robot not to visit
    certain parts of the site.
    
    As a Web Server Administrator you can create directives that make
    sense for your site. This page tells you how.
    
 Where to create the robots.txt file
 
    The Robot will simply look for a "/robots.txt" URL on your site, where
    a site is defined as a HTTP server running on a particular host and
    port number. For example:
    
              Site URL           Corresponding Robots.txt URL
      http://www.w3.org/      http://www.w3.org/robots.txt
      http://www.w3.org:80/   http://www.w3.org:80/robots.txt
      http://www.w3.org:1234/ http://www.w3.org:1234/robots.txt
      http://w3.org/          http://w3.org/robots.txt
    
    Note that there can only be a single "/robots.txt" on a site.
    Specifically, you should not put "robots.txt" files in user
    directories, because a robot will never look at them. If you want your
    users to be able to create their own "robots.txt", you will need to
    merge them all into a single "/robots.txt". If you don't want to do
    this your users might want to use the Robots META Tag instead.
    
    Also, remeber that URL's are case sensitive, and "/robots.txt" must be
    all lower-case.
    
           Pointless robots.txt URLs
      http://www.w3.org/admin/robots.txt
      http://www.w3.org/~timbl/robots.txt
      ftp://ftp.w3.com/robots.txt
    
    So, you need to provide the "/robots.txt" in the top-level of your URL
    space. How to do this depends on your particular server software and
    configuration.
    
    For most servers it means creating a file in your top-level server
    directory. On a UNIX machine this might be
    /usr/local/etc/httpd/htdocs/robots.txt
    
 What to put into the robots.txt file
 
    The "/robots.txt" file usually contains a record looking like this:
    
 User-agent: *
 Disallow: /cgi-bin/
 Disallow: /tmp/
 Disallow: /~joe/
 
    In this example, three directories are excluded.
    
    Note that you need a separate "Disallow" line for every URL prefix you
    want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/". Also,
    you may not have blank lines in a record, as they are used to delimit
    multiple records.
    
    Note also that regular expression are not supported in either the
    User-agent or Disallow lines. The '*' in the User-agent field is a
    special value meaning "any robot". Specifically, you cannot have lines
    like "Disallow: /tmp/*" or "Disallow: *.gif".
    
    What you want to exclude depends on your server. Everything not
    explicitly disallowed is considered fair game to retrieve. Here follow
    some examples:
    
   To exclude all robots from the entire server
   
 User-agent: *
 Disallow: /
 
   To allow all robots complete access
   
 User-agent: *
 Disallow:
 
    Or create an empty "/robots.txt" file.
    
   To exclude all robots from part of the server
   
 User-agent: *
 Disallow: /cgi-bin/
 Disallow: /tmp/
 Disallow: /private/
 
   To exclude a single robot
   
 User-agent: BadBot
 Disallow: /
 
   To allow a single robot
   
 User-agent: WebCrawler
 Disallow:
 
 User-agent: *
 Disallow: /
 
   To exclude all files except one
   
    This is currently a bit awkward, as there is no "Allow" field. The
    easy way is to put all files to be disallowed into a separate
    directory, say "docs", and leave the one file in the level above this
    directory:
    
 User-agent: *
 Disallow: /~joe/docs/
 
    Alternatively you can explicitly disallow all disallowed pages:
    
 User-agent: *
 Disallow: /~joe/private.html
 Disallow: /~joe/foo.html
 Disallow: /~joe/bar.html
 
 ---------------------------------------
 
     Упомянутая в тексте спецификация доступна по адресу
 http://info.webcrawler.com/mak/projects/robots/norobots.html
 
 /SCoon
 
 --- добрый доктор v0.43i/W32
  * Origin: NlC (2:5011/13.33)
 
 

Вернуться к списку тем, сортированных по: возрастание даты  уменьшение даты  тема  автор 

 Тема:    Автор:    Дата:  
 Робот. Кто он? :)   Andrey Kr.   06 Feb 2001 14:22:35 
 Робот. Кто он? :)   Vladislav Zlobin   06 Feb 2001 16:31:38 
 Робот. Кто он? :)   Andrey Kardapoltsev   09 Feb 2001 01:14:57 
Архивное /ru.website/159993ff7763a.html, оценка 2 из 5, голосов 10
Яндекс.Метрика
Valid HTML 4.01 Transitional