# robot parser allows access to a websites # robots.txt file (more on robots.txt) import robotparser # more on robotparser doc # Note: in python 3 robotparser will be found in # the urllib module at urllib.robotparser # examples using urllib # - copy image (or file) off web # - alter user agent string # - browse the web with python # the site I want to read url = "http://pythonicprose.blogspot.com/robots.txt" rob = robotparser.RobotFileParser() rob.set_url(url) # read and parse through the file rob.read() # if you are creating a web crawler or spider you may need to keep # track of how long it has been since you last read the robots.txt file # use modified to mark the time and mtime to read it rob.modified() # to get the time: rob.mtime() # check and see if any user agent can read the home page print rob.can_fetch("*", "/") # output: # True # check and see if any user agent can read the search page print rob.can_fetch("*", "/search") # output: # False # now that we are so many lines down from set_url we can check # the host we are processing print rob.host # output: # 'pythonicprose.blogspot.com'
A python example based blog that shows how to accomplish python goals and how to correct python errors.
Showing posts with label robotparser. Show all posts
Showing posts with label robotparser. Show all posts
Tuesday, October 6, 2009
Python - read robots.txt files with ease
Subscribe to:
Posts (Atom)