# Abstract

The Semantic Web Dog Food (SWDF) is the reference linked dataset of the Semantic Web community about papers, people, organisations, and events related to its academic conferences. In this paper we analyse the existing problems of generating, representing and maintaining Linked Data for the SWDF. With this work (i) we provide a refactored and cleaned SWDF dataset; (ii) we use a novel data model which improves the Semantic Web Conference Ontology, adopting best ontology design practices and (iii) we provide an open source workflow to support a healthy growth of the dataset beyond the Semantic Web conferences.

Permanent URL: https://w3id.org/scholarlydata

Resource type: Ontology and dataset.

# Introduction

A good practise in the Semantic Web community is to encourage the publication of Linked Data about scientific conferences in the field, as a way of "eating our own dog food" . The main example is the Semantic Web Dog Food (SWDF), a corpus that collects Linked Data about papers, people, organisations, and events related to academic conferences. Currently, all main Semantic Web conferences and related events publish their data as Linked Data on SWDF, but for many other conferences, events and publication venues information is still not available in a structured and linked form. On the other hand the growth of available content with respect to the early times of SWDF poses data management issues and reveals design problems which where not foreseen when the dataset was at its initial stage. There are several challenges to pursue the maintenance of a healthy and sustainable SWDF for the future: (i) the availability of appropriate vocabularies to express the current state of the data; (ii) the shared knowledge of such vocabularies; (iii) the availability of tools to ease the task of data acquisition, conversion, integration, augmentation, verification and finally publication; (iv) the ongoing maintenance of the dataset.

In this work we address these issues and we propose a refactoring of the Semantic Web Conference (SWC) Ontology. The new ontology, named conference-ontology , adopts best ontology design practices (e.g. Ontology Design Patterns, ontology reuse and interlinking) and guarantees interoperability with SWC ontology and all other pertinent vocabularies. We use cLODg (conference Linked Open Data generator) to regenerate the SWDF dataset according to conference-ontology and provide a sustainable solution for the growth of the dataset in the future.

The main advantage of the proposed approach is the availability of a shared procedure and open source tools for conference data generation, with the primary goal to ensure the sustainability and usability of our own Semantic Web Dog Food and the ease of data contribution from beyond our community. We make the new resource available at https://w3id.org/scholarlydata as data dump, SPARQL endpoint and we offer the facilities to generate data about new conferences using cLODg and submit it for addition to scholarlydata. The newly submitted data is manually checked before inclusion to avoid corruption of the dataset and general spam.

# State of the art

The first considerable effort to offer comprehensive semantic descriptions of conference events is represented by the metadata projects at ESWC 2006 and ISWC 2006 conferences , with the Semantic Web Conference Ontology being the vocabulary of choice to represent such data. Increasing number of initiatives are pursuing the publication about conferences data as Linked Data, mainly promoted by publishers such as Springer or Elsevier amongst many others. For example, the knowledge management of scholarly products is an emerging research area in the Semantic Web field known as Semantic Publishing . Semantic Publishing aims at providing access to semantic enhanced scholarly products with the aim of enabling a variety of semantically oriented tasks, such as knowledge discovery, knowledge exploration and data integration. The Semantic Publishing challenge is a breakthrough in this direction. Its objective is assessing the quality of systems that extract meaningful metadata from scholarly articles and represent them as RDF. Similarly, the Jailbreaking the PDF initiative is aimed at creating a formal flexible infrastructure to extract semantic information from PDF documents as domain-specific annotations.

# The SWDF and its current issues

The SWDF uses the Semantic Web Conference (SWC) ontology as the reference ontology for modelling data about academic conferences. The SWC ontology combines existing widely accepted vocabularies (i.e. FOAF, SIOC and Dublin Core) and relies on the SWRC (Semantic Web for Research Communities) ontology for modelling entities such as accepted papers, authors, their affiliations, talks and other events, the organising committee and all other roles involved. The core types of SWC ontology are foaf:Person for describing people, foaf:Organization for organisations (e.g. universities, research institutions, etc.), swc:Artefact for documents (e.g. papers, proceedings, etc.), swc:OrganisedEvent for events and swc:Role for the people roles at the conference. Unfortunately, the lack of clear guidelines for data generation and maintenance and some modelling choices of the SWC ontology affect the current quality of SWDF. The data generation is based on a collaborative model that delegates the metadata chairs of each conference to independently deal with the process of generating conference Linked Data. Linked Data are generated from a variety of formats typically provided by a conference management system (e.g. EasyChair). While the collaborative process is beneficial to the growth of the dataset and its adoption in the community, the lack of clear guidelines and of standard tools supporting the generation process affects the quality of generated data. Examples are: (i) a portion of the included conference/workshop data use vocabularies or ontologies which are not aligned to the SWC ontology and, in some cases, no longer maintained or existing (e.g. swrc-ext or xmllondon; (ii) the usage of classes and properties not defined in the SWC ontology and introduced without providing an extension of the ontology (e.g. swc:room, swc:editorList, swc:completeGraph, swc:IW3C2Liaison, swc:SemanticWebTechnologiesCo-ordinator, etc.); (iii) the misuse of properties (either defined in the SWC ontology or in other vocabularies/ontologies) with respect to their domain and range; (iv) typos (e.g. the materialisation of triples having the predicate swc:partOf instead of swc:isPartOf). In addition, we argue that the SWC ontology itself has intentional issues, mainly concerning the modelling of affiliations, roles and lists. Affiliations (of people to organisation) are represented via the object property swrc:affiliation from the SWRC ontology while the membership relation (organisation to people) via the property foaf:member. Although intuitive, this representation ignores the temporal dimension (i.e. the time when a given affiliation is held by an actor) that is relevant to interpret affiliations correctly. For example, with this model it is not possible to provide a correct answer to a simple competency question, such as "What was the affiliation of a person when participating to a certain conference?". Roles such as program chair, track chair, etc. are currently modelled using an ontology pattern based on the reification of a n-ary relation. The n-ary relation is identified by individuals of the class swc:Role which are used to associate people to events. The SWC ontology contains a very basic set of role classes (i.e. swc:Chair , swc:Delegate , swc:Presenter and swc:ProgrammeCommitteeMember ) represented as sub-classes of swc:Role. This choice allows to instantiate the small set of different Role classes and cover the roles at specific events. For example, instead of sub-classing the swc:Chair class with MainChair, WorkshopChair, TutorialChair, etc., the different types of chairs should simply be instances of the generic swc:Chair and labelled appropriately (e.g. iswc2015:general-chair). The problem with this solution is that the (individuals representing) roles are defined locally to each conference, e.g. there is a different individual for representing the role "general chair" for each conference in the dataset. This causes the presence of 1,717 distinct individuals in the current dataset that truly represent a set of only 34 unique roles (cf. ). Hence, it is difficult to answer simple queries like "Who was the general chair at each edition of ISWC?" without using regular expressions on roles' labels (such labels are heterogeneous and not always provided). Finally, lists of authors are represented via the property bibo:authorList, which accepts rdf:List or rdf:Seq as range. Therefore lists of authors in the SWDF are expressed via the properties rdf:_1 , rdf:_2 , rdf:_3 , etc., based on rdfs:ContainerMembershipProperty. This solution makes querying and reasoning on ordered list of authors very hard .

# A sustainable SWDF

Our solution for enhancing the SWDF and solving the issues described in is based on (i) the refactoring of the SWC ontology, (ii) the refactoring of the current SWDF dataset and (iii) a fully implemented open source workflow to generate, verify and add data to SWDF. The proposed refactoring of the SWC ontology, conference-ontology, is a new self-contained ontology, which exploits Ontology Design Patterns (ODP) . We model affiliations reusing the time indexed situation ODP and the roles held by people at a conference with the time indexed person role ODP. Both patterns provide commonly accepted solutions to model complex situations as n-ary relations, amongst many other available ones .

The classes conf:AffiliationDuringEvent and conf:AffiliationAtTimeOfSubmission model situations where a person (an individual of the class conf:Person) is affiliated to an organisation (an individual of the class conf:Organisation) at a specific time, which can be either an interval (coinciding with the conference dates) or the instant when the paper was submitted. This allows the representation of cases where a person changes affiliation in the time interval between paper submission and conference event. Similarly, the class conf:RoleDuringEvent associates a person with a role (an individual of the class conf:Role) at a conference. Additionally, conf:AffiliationDuringEvent and conf:AffiliationAtTimeOfSubmission can be associated with conf:AffiliationRole, a subclass of conf:Role, to represent the role held by a person within an organisation. We reused the Sequence ODP to represent ordered lists of authors. We represent a list with conf:List, whose items are individuals of the class conf:ListItem. The association between conf:ListItem and conf:List is done via the property conf:isItemOf. A conf:List has pointers to the first (conf:hasFirstItem) and the last item (conf:hasLastItem). Each conf:ListItem is linked to its predecessor (conf:hasPreviousItem) and successor (conf:hasNextItem). This new modelling overcomes the limitation in the current SWDF offering a new range of services for scholarly monitoring, such as statistics on career development, change of affiliations over time, covered roles at conferences in order to monitor their involvement and impact at different granularity levels, ranging from a broader scientific area to specific communities or conferences. An example of query to obtain all roles covered overtime by a specific researcher is the following:

				PREFIX person: <http://w3id.org/scholarlydata/person/>
PREFIX conf: <http://w3id.org/scholarlydata/ontology/conference-ontology.owl#>
SELECT ?role ?during
WHERE{  person:valentina-presutti conf:holdsRole ?roleAt .
?roleAt conf:withRole ?role .
?roleAt conf:during ?during}


To guarantee interoperability with SWC and all other already used vocabularies in the SWDF dataset, we produced extensive alignments, which allow the materialisation of triples via reasoning. We include alignments to:

• the SWC ontology itself to guarantee backward interoperability with SWDF;
• the top level classes of Dolce D0 for interoperability with a series of linked datasets aligned to it (e.g. DBpedia);
• all relevant SPAR ontologies such as: FaBIO for compliance with FRBR; DoCO for modelling the part relations (conf:hasPart and its inverse conf:isPartOf existing between abstracts conf:Abstract, articles conf:InProceedings and the books of proceedings conf:Proceedings); PRO and SCORO for the modelling of roles as defined in SPAR;
• the Organization Ontology for modelling organisations, roles and affiliations;
• FOAF for modelling people;
• SKOS for modelling broader/narrower relations;
• ICATZD for events;
• the Collection Ontology for modelling the sequences represented by the lists of authors.

# Scholarlydata.org

Using cLODg and our new conference-ontology we performed a batch cleaning of the whole SWDF dataset, consisting of 48 conferences and 235 workshops. The new dataset contains 93,519 individuals. The distribution of classes is reported in .

For the role definitions we corrected the current 1,717 roles in SWDF, defined at conference level, by generating 34 roles at global level and reusing them at conference level. E.g. the role role:general-chair is one individual which can be reused in all conferences with the relation conf:withRole. These 34 roles are organised in a hierarchy by using SKOS to express broader and narrower relations between them, e.g. the role role:chair is defined as skos:broader role:general-chair. The current list of roles can be obtained using the query:

					PREFIX conf: <http://w3id.org/scholarlydata/ontology/conference-ontology.owl#>
SELECT distinct ?role
WHERE{  ?person conf:holdsRole ?roleAt .
?roleAt conf:withRole ?role }


Using cLODg to produce metadata about a new conference guarantees that pertinent roles are reused if already existing the dataset.

We produced instance level alignments of (i) individuals of conf:Person to ORCID (Open Researcher and Contributor ID) and (ii) individuals of conf:InProceedings to DOI (Digital Object Identifier), whenever possible. ORCID provides persistent digital identifiers for scientific researchers and academic authors. A DOI is a serial code used to uniquely identify digital objects, particularly used for electronic documents. The alignments to ORCID were produced by using the public API provided by ORCID. The references to DOI were produced by using the API provided by Crossref, performing a search on each article title.

All data is uploaded on https://w3id.org/scholarlydata where can be accessed in different formats (i.e. HTML+RDFa, RDF/XML, Turtle, N-TRIPLES, and JSON-LD) via URI dereferencing, queried via SPARQL or downloaded as single RDF dumps for each conference and workshop. Each dump is provided in two versions: a simple one, where data is represented by the conference-ontology only and one containing all the alignments (and therefore also complaint to SWDF), which have been materialised using a reasoner. These dumps are released with the "creative commons by 3.0" license and are described by using the VOID vocabulary. Additionally, we explicitly state the primary source of our data is the SWDF by using the property prov:hadPrimarySource of PROV-O. Dump data is also publicly available on datahub. It is worth remarking that cLODg is released as an open source software with the MIT License and can be used by metadata curator to add data about a new conference. In fact, cLODg provides a nearly one-click process to produce conference Linked Data and includes all the components for data transformation, deduplication, URIs reuse, alignment of individuals to external resources, etc. and assures that data is produced according to the conference-ontology and compliant with the SWDF. An early description of cLODg can be found in and in the github repository for its newer version. By providing a user friendly data generation tool we aim at encouraging the growth of the dataset beyond the Semantic Web community.

# Conclusions and future work

This paper analyses the Semantic Web Dog Food dataset and discusses its quality and sustainability issues. As the main scholarly dataset for the Semantic Web community, we believe it is important that the dataset is maintained in good health. We therefore perform a refactoring on the dataset addressing its current issues and we make the cLODg workflow publicly available as potential solution for future maintenance. The new resource https://w3id.org/scholarlydata is publicly available both as dump download and SPARQL endpoint, with facilities to upload new data. With the availability of cLODg as standard Linked Data publication workflow, we believe that scholarlydata has the potential to grow way beyond the Semantic Web conferences. As future work we plan a systematic evaluation of the resource and the introduction of more sophisticated components to deal with instance matching in the cLODg workflow. Moreover we will work on fostering collaboration with Conference Management System providers, to provide cLODg as a build-in facility in the systems.

# References

1. W. Harrison. Eating your own dog food. Industrial and Organizational Psychology, (June):5–7, 2011.

2. A. G. Nuzzolese, A. L. Gentile, V. Presutti, and A. Gangemi. Semantic web conference ontology - a refactoring solution. In The Semantic Web: ESWC 2016 Satellite Events, page to appear. Springer, 2016.

3. A. L. Gentile, M. Acosta, L. Costabello, A. G. Nuzzolese, V. Presutti, and D. R. Recupero. Conference Live: Accessible and Sociable Conference Semantic Data. In Proc. of WWW 2015 (Companion Volume), pages 1007–1012. ACM, 2015.

4. A. L. Gentile and A. G. Nuzzolese. cLODg - Conference Linked Open Data Gen- erator. In Proc. of the ISWC 2015 Posters & Demonstrations Track. CEUR-WS.org, 2015.

5. K. Möller, T. Heath, S. Handschuh, and J. Domingue. Recipes for semantic web dog food: The eswc and iswc metadata projects. In Proc. of ISWC’07/ASWC’07, pages 802–815, Berlin, Heidelberg, 2007. Springer.

6. K. Möller, S. Bechofer, and T. Heath. Semantic web conference ontology. retrievable on line at http://data.semanticweb.org/ns/swc/swc_2009-05-09.html, 2009.

7. D. Shotton. Semantic publishing: the coming revolution in scientific journal publishing. Learned Publishing, 22(2):85–94, 2009.

8. C. Lange and A. Di Iorio. Semantic publishing challenge - assessing the quality of scientific output. In SemEval Challenge, volume 475 of CCIS, pages 61–76. Springer, 2014.

9. A. Garcia, P. Murray-Rust, G. Burns, R. Stevens, D. Tkaczyk, C. McLaughlin, A. Belin, A. Di Iorio, L. García, C. Gruson-Daniel, et al. Pdfjailbreak–a com- munal architecture for making biomedical pdfs semantic. In Proceedings of BioLINK SIG 2013, page 13, 2013.

10. P. Ciccarese and S. Peroni. The collections ontology: Creating and handling collections in owl 2 dl frameworks. Semantic Web, 5(6):515–529, 2014.

11. A. Gangemi and V. Presutti. Ontology Design Patterns. In S. Staab and R. Studer, editors, Handbook on Ontologies, 2nd Edition. Springer Verlag, 2009.

12. A. Gangemi and V. Presutti. A multi-dimensional comparison of ontology design patterns for representing n-ary relations. In International Conference on Current Trends in Theory and Practice of Computer Science, pages 86–105. Springer, 2013.

13. A. Gangemi and V. Presutti. A multi-dimensional comparison of ontology design patterns for representing n-ary relations. In International Conference on Current Trends in Theory and Practice of Computer Science, pages 86–105. Springer, 2013.

cLODG is an Open Source tool that provides a formalised process for the conference metadata publication workflow https://github.com/anuzzolese/cLODg2.

http://data.elsevier.com/documentation/index.html
http://xmlns.com/foaf/spec/
http://rdfs.org/sioc/spec/
http://dublincore.org/documents/dcmi-terms/
http://www.cs.vu.nl/~mcaklein/onto/swrc\_ext/2005/0
http://xmllondon.com/ns/swc/ontology
The prefix iswc2015: stands for the namespace http://data.semanticweb.org/conference/iswc/2015/.
The prefix role: stands for the namespace http://w3id.org/scholarlydata/role/.
http://orcid.org/
http://www.doi.org/index.html
http://members.orcid.org/api/introduction-orcid-public-api
http://www.crossref.org/guestquery/
http://vocab.deri.ie/void
https://www.w3.org/TR/prov-o
https://datahub.io/dataset/scholarlydata