The Road Runner Project
Towards Automatic Data Extraction from Large Web Sites


Experimental Results

The prototype we have developed has been used to conduct experiments on several real life Web sites. In each site, we have selected a number of classes of pages, and for each class we have downloaded a number of samples (usually between 10 and 20; when the class was sufficiently small, we have taken all samples). Then we have used the prototype to process the classes: for each class the prototype has inferred a common wrapper, and it has extracted all the relevant data from the samples.

The table below reports some representative results. For each site, we shows the sample pages of each class, the wrappers inferred by the prototype, and the data extracted by parsing the pages through the automatically generated wrapper.

amazon.com
The most popular e-commerce Web site

buy.com
A popular e-commerce Web site

wine.com
An e-commerce Web site dedicated to wines
uefa.com
The official Web site of the European Football (Soccer) Association
majorleguebaseball.com
The official Web site of the Majorleague Baseball
barnesandnoble.com
A popular e-commerce Web site
nba.com
The Official NBA Web Site
rpmfind.net
A site hosting Linux RPM software packages.
RISE
RISE (http://www.isi.edu/~muslea/RISE) is a distributed repository of online information sources that are used for the empirical analysis of learning algorithms that generate extraction patterns. The sources included in the repository are provided by people from the information extraction (IE) and wrapper generation (WG) communities. The site is maintained by Ion Muslea.
   

NOTE: All experiments have been conducted on a machine equipped with an Intel Pentium III processor working at 450MHz, with 128 MBytes of RAM, running Linux (kernel 2.2) and Sun Java Development Kit 1.3.


This page is maintained by Gianni Mecca and Paolo Merialdo
Road Runner: geococcyx californianus