This dataset is a collection of web-pages used for research on the automatic extraction of data from the Web.
The dataset includes detail pages dowloaded from 40 web sites. Detail pages refer to four domains:
- soccer players
- stock quotes
- video games
- books
Pages for the video games and the soccer players domains were gathered by means of a crawler based on a set expansion technique (
see paper). Stock quotes and books pages were collected by querying the forms of 10 finance sites, and the forms of 10 bookstore sites.