Venerdì 10 Aprile 14.00 - 15.00 Aula N02 Stefano Ortona (University of Oxford) Titolo: "WADaR: Joint Wrapper and Data Repair in Web Data Extraction" Abstract: Web scraping is a popular way of acquiring data from the web. Recent progress enabled scalable wrapper generation that has not yet been complemented by scalable tools for their maintenance and for the analysis of the extracted data. We present WADaR, a scalable and highly automated joint wrapper and data repair tool. WADaR uses off-the-shelf entity recognisers to locate entities of interest in wrapper-generated instances. Markov chains are then used to determine likely repairs for the data, that are then encoded into by regex-based repairs for both the data and corresponding wrappers. WADaR is capable not only of repairing the wrapper extracted data, but also the wrapper that generated the data, in a such a way that future extractions will not need to be repaired any more. WADaR is able to increase the quality of wrapper-generated instances between 15% and 60%, and to fully repair the corresponding wrapper without any knowledge of the original website in more than 50% of the cases in our experiments. We evaluated WADaR on 4 different wrapper generator systems and achieve significant improvement of the extracted data in all of the cases.