robots.txt 文件对抓取网络的搜索引擎漫游器(称为漫游器)进行限制。这些漫游器是自动的,在它们访问网页前会查看是否存在限制其访问特定网页的 robots.txt 文件。如果你想保护网站上的某些内容不被搜索引擎收入的话,robots.txt 是一个简单有效的工具。这里简单介绍一下怎么使用它。
如何放置 Robots.txt 文件
robots.txt自身是一个文本文件。它必须位于域名的根目录中并 被命名为"robots.txt"。位于子目录中的 robots.txt 文件无效,因为漫游器只在域名的根目录中查找此文件。例如,http://www.example.com/robots.txt 是有效位置,http://www.example.com/mysite/robots.txt 则不是。
这里举一个robots.txt的例子:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~name/
使用 robots.txt 文件拦截或删除整个网站
要从搜索引擎中删除您的网站,并防止所有漫游器在以后抓取您的网站,请将以下 robots.txt 文件放入您服务器的根目录:
User-agent: *
Disallow: /
要只从 Google 中删除您的网站,并只是防止 Googlebot 将来抓取您的网站,请将以下 robots.txt 文件放入您服务器的根目录:
User-agent: Googlebot
Disallow: /
每个端口都应有自己的 robots.txt 文件。尤其是您通过 http 和 https 托管内容的时候,这些协议都需要有各自的 robots.txt 文件。例如,要让 Googlebot 只为所有的 http 网页而不为 https 网页编制索引,应使用下面的 robots.txt 文件。
对于 http 协议 (http://yourserver.com/robots.txt):
User-agent: *
Allow: /
对于 https 协议 (https://yourserver.com/robots.txt):
User-agent: *
Disallow: /
允许所有的漫游器访问您的网页
User-agent: *
Disallow:
(另一种方法: 建立一个空的 "/robots.txt" 文件, 或者不使用robot.txt。)
使用 robots.txt 文件拦截或删除网页
您可以使用 robots.txt 文件来阻止 Googlebot 抓取您网站上的网页。 例如,如果您正在手动创建 robots.txt 文件以阻止 Googlebot 抓取某一特定目录下(例如,private)的所有网页,可使用以下 robots.txt 条目:
User-agent: Googlebot
Disallow: /private
要阻止 Googlebot 抓取特定文件类型(例如,.gif)的所有文件,可使用以下 robots.txt 条目:
User-agent: Googlebot
Disallow: /*.gif$
要阻止 Googlebot 抓取所有包含 ? 的网址(具体地说,这种网址以您的域名开头,后接任意字符串,然后是问号,而后又是任意字符串),可使用以下条目:
User-agent: Googlebot
Disallow: /*?
尽管我们不抓取被 robots.txt 拦截的网页内容或为其编制索引,但如果我们在网络上的其他网页中发现这些内容,我们仍然会抓取其网址并编制索引。因此,网页网址及其他公开的信息,例如指 向该网站的链接中的定位文字,有可能会出现在 Google 搜索结果中。不过,您网页上的内容不会被抓取、编制索引和显示。
作为网站管理员工具的一部分,Google提供了robots.txt分析工具 。它可以按照 Googlebot 读取 robots.txt 文件的相同方式读取该文件,并且可为 Google user-agents(如 Googlebot)提供结果。我们强烈建议您使用它。 在创建一个 robots.txt 文件之前,有必要考虑一下哪些内容可以被用户搜得到,而哪些则不应该被搜得到。 这样的话,通过合理地使用 robots.txt , 搜索引擎在把用户带到您网站的同时,又能保证隐私信息不被收录。
A robots.txt provides restrictions to search engine robots (known as "bots") that crawl the web. These bots are automated, and before they access pages of a site, they check to see if a robots.txt file exists that prevents them from accessing certain pages. If you want to protect some of your contents from being indexed by search engines, robots.txt is a simple tool for it. In this time, we would like to discuss how to use it.
The "/robots.txt" file is a text file, with one or more records. The robots.txt file must be reside in the root of the domain and must be exactly named "robots.txt". A robots.txt file located in a subdirectory is not a valid, as bots only check for this file in the root of the domain.
For instance, http://www.example.com/robots.txt is a valid location. But, http://www.example.com/mysite/robots.txt is not.
User-agent:*Block or remove your entire website using a robots.txt file
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~name/
To remove your site from search engines and prevent all robots from crawling it in the future, place the following robots.txt file in your server root:
User-agent: *
Disallow: /
To remove your site from Google only and prevent just Googlebot from crawling your site in the future, place the following robots.txt file in your server root:User-agent: Googlebot
Disallow: /
Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, to allow Googlebot to index all http pages but no https pages, you'd use the robots.txt files below. For your http protocol (http://yourserver.com/robots.txt):
User-agent: *
Allow: /
For the https protocol (https://yourserver.com/robots.txt):User-agent: *
Disallow: /
User-agent: *(alternative solution: Just create an empty "/robots.txt" file, or don't use one at all.)
Disallow:
Block or remove pages using a robots.txt file
You can use a robots.txt file to block Googlebot from crawling pages on your site.
For example, if you're manually creating a robots.txt file, to block Googlebot from crawling all pages under a particular directory (for example, private), you'd use the following robots.txt entry:User-agent: Googlebot
Disallow: /private
To block Googlebot from crawling all files of a specific file type (for example, .gif), you'd use the following robots.txt entry: User-agent: Googlebot
Disallow: /*.gif$
To block Googlebot from crawling any URL that includes a ? (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string): User-agent: Googlebot
Disallow: /*?
While we won't crawl or index the content of pages blocked by robots.txt, we may still crawl and index the URLs if we find them on other pages on the web. As a result, the URL of the page or other publicly available information such as anchor text in links to the site can appear in Google search results. However, no content from your pages will be crawled, indexed, or displayed.
1 Comments