Tuesday, October 6, 2009

Python - read robots.txt files with ease

# robot parser allows access to a websites 
# robots.txt file (more on robots.txt)  
 
import robotparser
# more on robotparser doc   
# Note: in python 3 robotparser will be found in 
# the urllib module at urllib.robotparser 
 
# examples using urllib 
#   - copy image (or file) off web  
#   - alter user agent string   
#   - browse the web with python   
 
 
# the site I want to read 
url = "http://pythonicprose.blogspot.com/robots.txt" 
 
rob = robotparser.RobotFileParser()
rob.set_url(url)
 
# read and parse through the file 
rob.read()
 
# if you are creating a web crawler or spider you may need to keep 
# track of how long it has been since you last read the robots.txt file 
# use modified to mark the time and mtime to read it 
rob.modified()
 
# to get the time: 
rob.mtime()
 
# check and see if any user agent can read the home page 
print rob.can_fetch("*", "/")
# output: 
#   True 
 
# check and see if any user agent can read the search page 
print rob.can_fetch("*", "/search")
# output: 
#   False 
 
# now that we are so many lines down from set_url we can check 
# the host we are processing 
print rob.host
# output: 
# 'pythonicprose.blogspot.com' 
 
 
 

No comments:

Post a Comment