Obey Liu Using DBpedia for Relation Discovery on the Semantic Web resultats

De Ensiwiki
Aller à : navigation, rechercher


Using DBpedia for Relation Discovery on the Semantic Web

Labo INRIA
Equipe EXMO
Encadrants jerome.euzenat@inrialpes.fr

Student

Obey LIU

Introduction

Exploring the Semantic Web has become more and more useful thanks to the growth in number and breadth of semantic data sources. Current approaches often involve exploring multiple distributed ontologies but this is highly dependent on the quality of cross-ontology mappings. Instead, other approaches concentrate on using a web of manually linked datasets such as the Linked Data project.

The Linked Data project connects through the Web related datasets that weren't previously linked or makes it easier to link datasets that were previously linked through various incompatible methods. Wikipedia defines Linked Data as "a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF." Among the many datasets linked by this project, the DBpedia dataset has become a de facto core, or nucleus, becoming the point of rendez-vous for many other datasets because of its large size, wide thematic coverage and dynamism.

Today, most ontologies and datasets cover only specific domains, are created by relatively small groups, and are very cost intensive to keep up-to-date. At the same time, Wikipedia has grown into one of the most central and versatile knowledge sources available, maintained by thousands of contributors. The DBpedia project leverages this gigantic source of knowledge by extracting structured information from Wikipedia.

Relation discovery is the problem of finding a path between two nodes on a semantic graph. This problem has applications for example in semantic enrichment of wordsets or finding paths of logical inference. Because of the way Wikipedia is created, through loosely coordinated incremental editing by volunteers, and its seldom attained size, more than 2.6 million concepts, DBpedia creates unique challenges and opportunities for these applications. Our particular goal here is to automatically create paths between concept nodes on the DBpedia semantic graph that are similar to what a human tasked to do it would find manually. We aspire to reproduce the common sense that a manual operator would apply.

For this paper, we analyzed the structure of the DBpedia extracted ontology and dataset, classifying various kinds of relations and their uses in relation discovery. With this knowledge, we experimented by trial and error with search algorithms to discover paths between nodes in an efficient way. To easily execute these searches, we wrote various tools in Python against a Virtuoso RDF datastore containing the latest full dump of DBpedia in all its languages.

pre-requisite elements

To understand this research, an understanding of basic Semantic Web concepts and some familiarity with Wikipedia are necessary.

Realized work

Overall, this paper makes the following contributions:

  • development of high-level Python tools building blocks to work with DBpedia
  • analysis of the various kinds of relations in DBpedia and their usefulness in the Semantic Web
  • experimentation of various DBpedia-specific search algorithms and graph pruning heuristics

Storing and exploring the DBpedia ontologies and dataset

The DBpedia dataset is presented as a large multi-domain ontology. It currently describes 2.6 million things with 274 million facts (as of November 2008).

Wikipedia source articles consist mostly of free text, but also contain some structured information, such as infobox templates or category trees. Each thing in the DBpedia data set is identified by a URI reference of the form http://dbpedia.org/resource/Name.

Dbpedia-relation-discovery-supermarket-building.png

For this research, we used the full dump of DBpedia 3.2 in 35 languages, totaling 38 Gb of uncompressed triples in NT format, separated in various files, depending on the dataset and language. When loaded into native datastore format or into memory cache, this takes about 60 Gb.

Here is an example of a rather advanced SPARQL query and its graphically represented output:

SELECT ?name ?birth ?description ?person WHERE {
     ?person dbo:birthplace <http://dbpedia.org/resource/Berlin> .
     ?person skos:subject <http://dbpedia.org/resource/Category:German_musicians> .
     ?person dbo:birthdate ?birth .
     ?person foaf:name ?name .
     ?person rdfs:comment ?description .
}

This queries multiple datasets for the name, birthdate, description and URI of German musicians who were born in Berlin.

Dbpedia-relation-discovery-german-musician.png

Basic relation discovery in DBpedia

DBpedia is, after all, a very large oriented graph. One of the most elementary task, or so it seems, is to find a path between two nodes, Things, on this graph.

The basic algorithmic problem here is the one of the uninformed, multiple sources, multiple destinations shortest paths in a positively weighted graph. For this, we examined two algorithms:

  • a Bidirectional Breadth-First Search with a FIFO queue executed sequentially from all sources and destinations pairs
  • a variant of Bidirectional Breadth-First Search with a priority queue, executed simultaneously from all sources and destinations (described later here)

The first search algorithm returns the shortest path considering the graph as unweighted. This leads to a broad range of issues. First among these are the heavy presence of certain kinds of relations of questionable usefulness: Dbpedia-relation-discovery-disambiguates.png

  • Disambiguation nodes are a way to resolve conflicts in article titles when a single term can be associated with more than one topic, making that term likely to be the natural title for more than one article. For example, the word Mercury can refer to several different things, including an element, a planet, an automobile brand...
  • Redirections are RDF nodes that are only used to point to another node. They are used for two main purposes: alternative article naming (for example E. coli redirects to Escherichia coli) which is good or inclusionned articles (for example Blackberry 8820 redirects to List of BlackBerry products) which creates semantic noise
  • Wikilinks are the simplest links from one Wikipedia article to another. They are parsed from Wikipedia articles bodies for DBpedia as simple source page and destination page pairs lacking the relation property.

Improved algorithms and heuristics

To help with our work, we would like to define a rough distance metric for each triple. This distance metric roughly correlates with the semantic strength of the measured relation. First, we need to modify our exploration algorithm to handle a weighted graph.

For the second iteration of the relation explorer, implementing multi-directional breadth-first search with priority queue, an algorithm similar to the Dijkstra algorithm.

This new algorithm opens new datasets to exploration, starting with the Wikipedia Categories dataset.

Dbpedia-relation-discovery-flower-cucumber.png

This dataset is important because it can hierarchically organize islands of RDF data. We have found that most categories relations for articles have similar relations with wikilinks. With the SKOS vocabulary, categories provide a much better replacement.

This new algorithm also let us consider the semantic graph as a weighted graph.

As we consider each triple as a source, a relation and a destination, we can calculate a distance based on any of these elements or combinations of thereof. In our implementation, we start with a base 1 distance and add various values based on cases we match, such as penalizing wikilinks.

So far, these distances values are determined in an arbitrary fashion, based on experimental observations, but already improve results for a few test cases, but we discovered that changing topics often led to degraded results.

Conclusion

For this paper, we analyzed the structure of the DBpedia extracted ontology and dataset. When classifying the various kinds of relations and their uses in relation discovery, we found a very varied range of relations, some of which warranted special treatment to be made useful or to be excluded.

Although still far from the quality of hand-picked semantic links, the links we were able to create at the end of this research were good enough to create rudimentary ontologies from word sets by building semantic links between entities from the set using our algorithms.

To execute our various relations searches, we wrote many generic tools in Python that can be reused for other DBpedia research.

We have confirmed that the multiple and vast datasets of DBpedia can open a large range of study topics and be useful to test multiple semantic exploration algorithms. Indeed many areas are left to explore, by using more datasets, a deeper analysis and better heuristics.

With new versions, the DBpedia dataset gets more data and becomes more semantic:

  • The OWL Ontology dataset has only been used as RDF in this paper. The OWL-specific properties of the DBpedia Ontology, such as collections, unions and disjunctions, could be used to improve the quality of our paths

We have only developed simple heuristics for adding weights to the DBpedia semantic graph.

  • More global matchers, such as particular sequences of patterns or PageRank style metrics have proven effective in other areas
  • Since we wrote our search tools in Python, we can leverage general bayesian networks libraries such as Open Bayes to massively create new weights based on subjective human input

References

  • Christian Bizer and Andreas Schultz. The berlin sparql benchmark. International Journal On Semantic Web and Information Systems, 5(1), 2009. http://www4.wiwiss.fu-berlin.de/bizer/pub/Bizer-Schultz-Berlin-SPARQL-Benchmark-IJSWIS.pdf
  • Martin Hepp, Katharina Siorpaes, and Daniel Bachlechner. Harvesting wiki consensus: Using wikipedia entries as vocabulary for knowledge management. IEEE Internet Computing, 11(5):54–65, 2007. http://doi.ieeecomputersociety.org/10.1109/MIC.2007.110
  • Jens Lehmann, Jörg Schüppel, and Sören Auer. Discovering unknown connections - the dbpedia relationship finder. In Sören Auer, Christian Bizer, Claudia Müller, and Anna V. Zhdanova, editors, CSSW, volume 113 of LNI, pages 99–110. GI, 2007. http://www.informatik.uni-leipzig.de/~auer/publication/relfinder.pdf
  • Hien T. Nguyen and Tru H. Cao. Named entity disambiguation on an ontology enriched by wikipedia. In RIVF, pages 247–254. IEEE, 2008. http://dx.doi.org/10.1109/RIVF.2008.4586363
  • Marta Sabou, Mathieu d’Aquin, and Enrico Motta. Using the semantic web as background knowledge for ontology mapping. In Pavel Shvaiko, Jérôme Euzenat, Natalya Fridman Noy, Heiner Stuckenschmidt, V. Richard Benjamins, and Michael Uschold, editors, Ontology Matching, volume 225 of CEUR Workshop Proceedings. CEUR-WS.org, 2006. http://ceur-ws.org/Vol-225/paper1.pdf
  • Marta Sabou, Mathieu d’Aquin, and Enrico Motta. Scarlet: Semantic relation discovery by harvesting online ontologies. In Sean Bechhofer, Manfred Hauswirth, Jörg Hoffmann, and Manolis Koubarakis, editors, ESWC, volume 5021 of Lecture Notes in Computer Science, pages 854–858. Springer, 2008. http://dx.doi.org/10.1007/978-3-540-68234-9_72
  • Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. wikipedia and wordnet. J. Web Sem., 6(3):203–217, 2008. Yago: A large ontology from wikipedia and wordnet. J. Web Sem., 6(3):203–217, 2008. http://dx.doi.org/10.1016/j.websem.2008.06.001}

Additional documents