A cluster based RDFa spider
The RDFa Web Spider (RDFaS) was my 3rd year dissertation project.
Abstract (from the final report)
Current web 'standards' formalize formatting and provision of information on the Web, but little of this information can be put into context by a machine without heavy analysis. A proposed XHTML extension called RDFa allows the content creator to specify the type of data on a web page which implies or specifies the context and relationship of this data. This allows automated processes to potentially discern the meaning of the information. There are many search engines for several different types of media, but most commonly they allow the user to search content on the Web, return results based on a relevance match which is often done by the frequency in which the search term appears in the document. The aim of this project is to index pages which contain RDFa data for searching, tackling issues involved with and providing more research crawling and indexing large numbers of pages and enormous amounts of data.
For more information
The Survey & Analysis report, Final report and presentation poster are all available here.