# FAS_scraper.py # v.1.0 (March 1, 2010) # Mel Chua # This is a quick proof-of concept scraper inspired by Diana Martin's research # on the Fedora community; she's trying to get a gauge on who in Fedora # is an "active contributor," so I suggested making a tiny scraper to gather # all the FAS-authenticated activity of a user from existing webpages. # I'm pretty sure most of these services have APIs that would do the job # better and less kludgily, but this is just to see if it's a useful thing. == Caveat == This isn't actually a proper README.txt - rather, a quick hack taken from the opening code comments. The python code itself is extensively commented (there are 11 lines of actual code in the 46-line file). == Installation == You will need python and twill installed to run this script. On Fedora: yum install python python-twill Then download FAS_scraper.py into a directory and run it: python FAS_scraper.py You'll see a lot of output (the html of the pages being scraped) being dumped into your terminal; I'm leaving it verbose for now on purpose so people can see what's going on. You'll end up with a series of .html in the directory that FAS_scraper.py is in. These contain the raw html dumps of the profile pages for that FAS user for each specified service. == Sample output == http://mchua.fedorapeople.org/FAS_scraper/sample_output == Further developments == Some quick suggestions for further work - what actually needs to happen is for this to be re-architected into a good general-purpose python library for getting data from FAS-authenticated services. * Instead of manually defining the list of FAS usernames in the code, grab the list of usernames from the actual FAS system. * Check for validity of FAS users you're looking for - right now, if you enter a username that doesn't exist, the program will try to download the pages for that user anyway. (It won't stop the program, you'll just get output for that user consisting of webpages saying that the user doesn't exist.) * Add more services. * Check for validity of services. * Create a class for services so that we can handle cases that aren't reachable by the format /. (For instance, what if it's //?) * Create a class for users that can parse and spit out statistics for each of the services you're looking at. For instance, can you automatically get the value of username.pkgdb.number_maintained()?