The
Road Runner Project
Towards
Automatic Data Extraction from Large Web Sites
Experimental Results
The prototype
we have developed has been used to conduct experiments on several
real life Web sites. In each site, we have selected a number of
classes of pages, and for each class we have downloaded a number
of samples (usually between
10 and 20; when the class was sufficiently small, we have taken
all samples). Then we have used the prototype to process the classes:
for each class the prototype has inferred a common wrapper, and
it has extracted all the relevant data from the samples.
The table
below reports some representative results. For each site, we shows
the sample pages of each class, the wrappers inferred by the prototype,
and the data extracted by parsing the pages through the automatically
generated wrapper.
NOTE:
All
experiments have been conducted on a machine equipped with an
Intel Pentium III processor working at 450MHz, with 128 MBytes
of RAM, running Linux (kernel 2.2) and Sun Java Development Kit
1.3.
|