The Road Runner Project
Towards Automatic Data Extraction from Large Web Sites

Experimental Results

The prototype we have developed has been used to conduct experiments on several real life Web sites. In each site, we have selected a number of classes of pages, and for each class we have downloaded a number of samples (usually between 10 and 20; when the class was sufficiently small, we have taken all samples). Then we have used the prototype to process the classes: for each class the prototype has inferred a common wrapper, and it has extracted all the relevant data from the samples.

The table below reports some representative results. For each site, we shows the sample pages of each class, the wrappers inferred by the prototype, and the data extracted by parsing the pages through the automatically generated wrapper.
The most popular e-commerce Web site
A popular e-commerce Web site
An e-commerce Web site dedicated to wines
The official Web site of the European Football (Soccer) Association
The official Web site of the Majorleague Baseball
A popular e-commerce Web site
The Official NBA Web Site
A site hosting Linux RPM software packages.
RISE ( is a distributed repository of online information sources that are used for the empirical analysis of learning algorithms that generate extraction patterns. The sources included in the repository are provided by people from the information extraction (IE) and wrapper generation (WG) communities. The site is maintained by Ion Muslea.

NOTE: All experiments have been conducted on a machine equipped with an Intel Pentium III processor working at 450MHz, with 128 MBytes of RAM, running Linux (kernel 2.2) and Sun Java Development Kit 1.3.

