realomat/crawler.py at 417410395a3fd82452d1140c23a0eb9fda89c1a9

Fork: 0

vdr / realomat

Find file

Newer

Older

realomat / crawler.py

Vinzenz Rosenkranz on 8 Jul 2016 422 bytes add comment

Raw Blame History

import AdvancedHTMLParser

# crawls https://www.bundestag.de/bundestag/plenum/abstimmung/2016 for votes in xls format
# should then evaluate and store the results in a database

parser = AdvancedHTMLParser.AdvancedHTMLParser();

parser.parseFile("bundestag.html")
links = parser.getElementsByClassName("linkGeneric")
for link in links:
    href = link.getAttribute("href")
    if href.endswith(".xls"):
        print href