# robot parser allows access to a websites # robots.txt file (more on robots.txt) import robotparser # more on robotparser doc # Note: in python 3 robotparser will be found in # the urllib module at urllib.robotparser # examples using urllib # - copy image (or file) off web # - alter user agent string # - browse the web with python # the site I want to read url = "http://pythonicprose.blogspot.com/robots.txt" rob = robotparser.RobotFileParser() rob.set_url(url) # read and parse through the file rob.read() # if you are creating a web crawler or spider you may need to keep # track of how long it has been since you last read the robots.txt file # use modified to mark the time and mtime to read it rob.modified() # to get the time: rob.mtime() # check and see if any user agent can read the home page print rob.can_fetch("*", "/") # output: # True # check and see if any user agent can read the search page print rob.can_fetch("*", "/search") # output: # False # now that we are so many lines down from set_url we can check # the host we are processing print rob.host # output: # 'pythonicprose.blogspot.com'
A python example based blog that shows how to accomplish python goals and how to correct python errors.
Tuesday, October 6, 2009
Python - read robots.txt files with ease
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment