Friday, April 13, 2007

(Web)Harvesting the web

I was recently assigned a task whereby I had to obtain data from the website of one of our commercial services, but the website is completely Web 1.0 and has no APIs of any kind that
we could hook into. To give you a bit of context, the site performs transactions on our behalf and we need information about those transactions. As my boss saw it, there were only two solutions really available :
1) Have somebody sit at a computer and download the transactions every 15 minutes
2) Have a computer set up to run a screen macro to log in and download the transactions file every 15 minutes (better, but still no where near ideal)

I think without even realizing it, my boss gave me the idea for option three. He continually mentioned the idea of screen-scraping the page to retrieve the information. Traditionally this means taking a visual representation of something and extracting information from what's essentially a picture. After doing some reading, I interpreted his suggestions to mean that I should find a way to do a web-scrape on the page (subtle difference). After doing some research, I found the perfect library for doing it in Java.

It's a project called WebHarvest, and it's fairly simple yet really powerful. You start off by writing a Scraper configuration in XML, load it in code, run the scraper, and it will store your data for you in the Scraper to retrieve when need it. The library itself works by doing either POSTs or GETs to a page, taking the response data, (almost certainly) doing a transform to convert the HTML to well-formed XHTML, and then running XPath queries and regular expressions on the result to get the data you need (ie rows of a table). It's incredibly powerful and if you need a solution where you want to automate logging into a website and retrieving data, then this is a great way to do it.

No comments: