• No results found

3. MANAGING DISTRIBUTED KNOWLEDGE

3.4. Enriching knowledge bases with linked data

3.4.1.1. Usage scenarios for linked data

Since 2009 the linked data paradigm has emerged as a lightweight approach to improving data portability among systems. Based on Semantic Web standards, linked data marks a transition from hierarchies to networks as an organizational principle for data and knowledge [6]. Hence, the primary value proposition of linked data is rooted in its modular and simple network characteristics [37]. By sharing RDF as a unified data model, linked data provides the infrastructure for publishing and repurposing data on top of semantic interoperability.

Taking the network characteristics of linked data into account, it is possible to identify three prototypical usage scenarios.

Scenario 1: Internal perspective: From an internal perspective, organizations can make use of linked data principles to organize their information within their closed environments. This is especially relevant for organizations that have to deal with an increasing number of dispersed databases, federated repositories and the legacy issues deriving from them. There is high potential for linked data to consolidate these infrastructures without necessarily disrupting existing systems and workflows.

Scenario 2: Inbound perspective: In the second scenario, organizations use external data sources for purposes such as content pooling or content enrichment. This trend is basically backed by the increasing availability of open data (i.e. provided by governmental bodies, community projects like Wikipedia, Musicbrainz36 or Geonames37 and an increasing amount of commercial data providers like Socrata,38 Factual39 or Datamarket40). Instead of creating these resources on their own, organizations can use this data either free of charge (according to the terms of trade) or as a paid service according to the service levels of an API.

Scenario 3: Outbound perspective: In the third scenario, organizations apply linked data principles to publish data on the Web either as open data or via an API that allows the fine granular retrieval of data according to a user’s needs. This process — often called linked data publishing — is basically a diversification of the data distributions strategy of an organization and allows an organization to become part of a linked data cloud. Data publishing strategies often go hand in hand with the diversification of business models and require a good understanding of the licensing issues associated with it. Few enterprises have yet started to engage in such practices. But they are key to leveraging the full potential of linked data for purposes such as e-commerce and e-procurement.

It is important to note that linked data ecosystems are not necessarily open ecosystems. Nevertheless, the degree and extent to which linked data is being provided to the public is an important design issue, if networking effects are intended to unfold around available data. In most cases it will be reasonable to design a differentiated open access policy that makes certain parts of a linked data set available to the public under an appropriate open licence, while providing other parts of a data set under traditional property rights. Choosing the correct set of licences and developing an appropriate open access policy is not a trivial task and is discussed in more detail in Section 3.4.3.

36 See http://classic.musicbrainz.org/

37 See http://www.geonames.org/

38 See http://www.socrata.com/

39 See http://www.factual.com/

40 See https://www.qlik.com/us/products/qlik-data-market

3.4.2. Linked data in the content value chain

As communication within electronic networks has become increasingly content-centric [38] — in terms of the growing amount of unstructured information that is produced and transmitted electronically41 — linked data gains importance as a lightweight approach in structuring and interlinking content. The content production process consists of five sequential steps: (i) content acquisition, (ii) content editing, (iii) content bundling, (iv) content distribution and (v) content consumption. As illustrated in Fig. 6, linked data can contribute to each step by supporting the associated intrinsic production function.

Content acquisition is mainly concerned with the collection, storage and integration of relevant information necessary to produce a content item. In the course of this process information is being pooled from internal or external sources for further processing.

Content editing entails all necessary steps that deal with the semantic adaptation, interlinking and enrichment of data. Adaptation can be understood as a process in which acquired data is provided in such a way that it can be reused within editorial processes. Interlinking and enrichment are often performed via processes like annotation and/or referencing to enrich documents either by disambiguating existing concepts or by providing background knowledge for deeper insights.

Content bundling is mainly concerned with the contextualization and personalization of information products. It can be used to provide customized access to information and services (i.e. by using metadata for the device sensitive delivery of content, or to compile thematically relevant material into landing pages or dossiers), thus improving the navigability, findability and reuse of information.

Content distribution mainly deals with the provision of machine readable and semantically interoperable (meta-)data via APIs or SPARQL endpoints. These can be designed either to serve internal purposes so that data can be reused within controlled environments (i.e. within or between organizational units) or for external purposes so that data can be shared between anonymous users (i.e. as open SPARQL endpoints on the Web).

Content consumption is the last step in the content value chain. This entails any means that enable a human user to search for and interact with content items in a user-friendly and purposeful way.

This step thus mainly deals with end user applications that make use of linked data to provide access to

41 Reference [38] reports for the time period from 2011 to 2016 an increase of 90% of video content, 76% of gaming content, 36% VoIP, 36% file sharing being transmitted electronically.

Harvesting, storage &

integration of internal or external data sources for purposes like Content Pooling

Semantic analysis, adaptation & linking

of data for purposes like Content Enrichment

Contextualisation &

personalisation of information products

for purposes like Landing Pages,

Dossiers or Customized Delivery

Provision of machine-readable &

semantically interoperable data &

metadata via APIs or Endpoints

Improved findability, navigability &

visualization on top of semantic metadata via Semantic Search &

Recommendation Engines Content

Acquisition Content

Editing Content

Bundling Content

Distribution Content Consumption

FIG. 6. Linked data in the content value chain (Ref. [39]).

content items (i.e. via search or recommendation engines) and generate deeper insights (i.e. by providing reasonable visualizations).

Reference [40] proposes a model that describes various stakeholder roles in the creation of linked data assets, as illustrated in Fig. 7.

The model distinguishes between the various stakeholder roles an economic actor can take in the creation of linked data assets and the various types of data and applications that are being created along the data transformation process. Along the value creation process, raw data — which is provided in any kind of non-RDF format (e.g. XML, CSV, PDF, HTML) — is transformed into linked data.

According to the model, raw data is consumed by a linked data provider and transformed into RDF thereby gaining compliance with linked data principles. This step is crucial in enhancing the semantic interoperability of the data. The transformation process can vary significantly in technological complexity, ranging from simple mapping procedures (i.e. with tools like Google Refine) to heavyweight extensible stylesheet language transformations (XSLTs). The complexity depends on the underlying schema’s complexity and sometimes requires highly skilled professionals to manage this task properly. In a further step, the linked data is consumed and processed by a linked data application that also provides the interface to the human user. These applications usually do not differ in look and feel from existing ones but use linked data for functional extensions for reasoning and filtering purposes. Finally, the end user consumes the human readable data via functionally extended applications and services.

This view is extended in Ref. [41] with an orthogonal layer called ‘support services and consultation’, stressing the fact that apart from the value creation process itself, linked data also creates an environment for added value services that transcends the pure transformation and consumption of data. Such services

FIG. 7. Linked value chain (Ref. [40]).

are usually provided by data brokers, who collect, clean, visualize and resell available data for further processing and consumption.42

As illustrated in Fig. 7, the process of linked data creation can be covered in its entirety by one actor or might require several economic actors depending on the technological complexity of the data transformation process. Existing use cases (e.g. from the BBC or New York Times) reveal that linked data transformations are usually outsourced rather than handled in house. For the time being, it is also difficult to estimate the cost effectiveness of linked data applications, but model based analysis (see Refs [26], [28]) indicates that the savings can be significant depending on the scale and scope of a linked data project. This is due to the network effect linked data generates as an integration layer across various components and workflows in complex IT systems. Herein, linked data can help to reduce technological redundancies thus reducing maintenance costs, improving information access in terms of reduced search and discovery efforts and providing opportunities for business diversification due to the higher granularity and increased connectivity of content.

3.4.3. Licensing linked data

New technology has never been a sufficient precondition for the transformation of business practices but has always been accompanied by complementary modes of cultural change [44]. Although linked data fits neatly into the incremental IT development practices, it comes along with disruptive technological effects that pose significant challenges that change the nature and notion of data as an economic asset [42].

Hence, it is crucial to develop an appropriate licensing strategy for linked data that takes account of the various asset specificities of linked data as intellectual property.43

3.4.3.1. Intellectual property of linked data

Semantic metadata is a fairly new kind of intellectual asset that is still subject to debate about the adequate protection instruments [47]. In the European Union the legal framework of property rights related to linked data comprises copyright,44 database right,45 competition law46 and patent law47. These appropriative legal regimes are complemented by open access policies and licensing instruments. Table 3 illustrates how linked data assets and intellectual property rights relate to each other.

Copyright protects the creative and original nature of a literary work and gives its holder the exclusive legal right to reproduce, publish, sell or distribute the matter and form of the work. Hence, any literary work that can claim a sufficient degree of originality can be protected by copyright.

Database right protects a collection of independent works, data or other materials that are arranged in a systematic or methodological way and are individually accessible by electronic or other means.

Databases are also protected as literary works and need to have a sufficient degree of originality, which requires a substantial amount of investment.

An Unfair Practices Act protects rights holders against certain trade practices that are considered unfair in terms of misappropriation, advertising, sales pricing or damages to reputation. Misappropriation

42 See also Archer, et al. (Ref. [42], p. 13), who carried out a study on business models for linked open government data. A discussion of the data broker industry is provided by the US Committee on Commerce, Science and Transportation [43].

43 A detailed discussion of the licensing issues associated with linked data can be found in Refs [45] or [46].

44 See Directive 2001/29/EC. See also http://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX:32001L0029

45 See Directive 96/9/EC. See also http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31996L0009:E N:HTML

46 See consolidated versions of the Treaty on European Union and the Treaty on the Functioning of the European Union – Official Journal C 326, 26.10.2012. pp. 0001–0390. See also http://eur-lex.europa.eu/legal-content/EN/TXT/

HTML/?uri=CELEX:12012E/TXT&from=EN

47 See http://www.epo.org/law-practice/legal-texts/html/epc/1973/e/ar52.html

especially is relevant to semantic metadata, occuring when data is reused without appropriate compensation (i.e. in terms of attribution or financial return).

Patenting does not directly impact the protection of semantic metadata as — at least in Europe — patents can be acquired soley for hardware related inventions. But as soon as semantic metadata becomes indispensable, is the subject of a methodology that generates physical effects, has a sufficient level of inventiveness and can be exploited commercially, these components can be protected under patent law.

3.4.3.2. Commons’ based approaches

The open and non-proprietary nature of Semantic Web design principles allow for the easy sharing and reuse of linked data for collaborative purposes. This offers new opportunities to organizations to diversify their assets and nurture new forms of value creation (i.e. by extending the production environment to open or closed collaborative settings) or unlock new revenue channels (i.e. by establishing highly customizable data syndication services on top of fine granular accounting services based on SPARQL).

Thus, traditional intellectual property rights regimes should be complemented by open access policies and according licensing instruments.48 Creative Commons49 allows for the defining of licensing policies for the reuse of work protected by copyright. Open Data Commons50 does the same thing for assets protected by database right. And open-source licences complement the patent regime as an alternative form of resource allocation and value generation in the production of software and services [44]. Creative Commons and Open Data Commons have gained popularity over the last few years, allowing maximum reusability while at the same time providing a framework for protection against unfair usage practices and rights infringements. Nevertheless, to meet the requirements of the various linked data asset types, a linked data licensing strategy should make a deliberate distinction between the database and the content stored in it. This is necessary as content and databases are distinct subjects of protection in intellectual property law and therefore require different treatment and protection instruments. Additionally, open licences should be provided in machine readable form, using rights expression languages such as CCREL51 or ODRL52, so that this information can be used by machines to filter datasets according to their terms of use.

48 A detailed discussion of licensing issues related to linked open data is provided in Ref. [46].

49 See http://creativecommons.org/

50 See http://opendatacommons.org/

51 See http://www.w3.org/Submission/ccREL/

52 See https://www.w3.org/community/odrl/

TABLE 3: LINKED DATA ASSETS AND RELATED PROPERTY RIGHTS (REF. [39])

Copyright DB right Comp. law Patents

Instance data NO YES PARTLY NO

Metadata NO YES YES NO

Ontology YES YES YES NO

Content YES NO YES NO

Service YES NO YES PARTLY

Technology YES NO YES PARTLY

3.4.4. Providing a community norm

In addition to licensing information expressed by Creative Commons and Open Data Commons a so-called community norm is the second component of a linked data licensing policy. Such a norm is basically a human readable recommendation of how the data should be used, managed and structured as intended by the data provider. It should provide administrative information (e.g. creator, publisher, licence, rights), structural information about the dataset (e.g. version number, quantity of attributes, types of relations) and recommendations for interlinking (e.g. preferred vocabulary to secure semantic consistency).

Community norms can differ widely in depth and complexity. Below we present three real world examples that illustrate what community norms look like:

Example 1: A community norm as part of an RDF-S statement by the University of Southampton.

36 rdfs:comment "This data is freely available to use and reuse. Please provide an attribution to University of Southampton, if convenient. If you're using this data, we'd love to hear about it at [email protected]. For discussion on our RDF, join http://mailman.ecs.soton.ac.uk/mailman/listinfo/ecsrdf, for announcements of changes, join http://mailman.ecs.soton.ac.uk/mailman/listinfo/ecsrdf-announce."^^xsd:string;

Example 2: A community norm provided by http://datahub.io/dataset/uniprot

Copyright 2007–2012 UniProt Consortium. We have chosen to apply the Creative Commons Attribution-NoDerivs License (http://creativecommons.org/licenses/by-nd/3.0/) to all copyrightable parts (http://sciencecommons.org/) of our databases. This means that you are free to copy, distribute, display and make commercial use of these databases, provided you give us credit. However, if you intend to distribute a modified version of one of our databases, you must ask us for permission first. All databases and documents in the UniProt FTP directory may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy.

Example 3: A community norm provided by the International Press Telecommunications Council (IPTC) for embedding their metadata into media files.53

Embedded Metadata Manifesto

How metadata should be embedded and preserved in digital media files

Photographers, film makers, videographers, illustrators, publishers, advertisers, designers, art directors, picture editors, librarians and curators all share the same problem: struggling to track rapidly expanding collections of digital media assets such as photos and video/film clips.

With that in mind we propose five guiding principles as our "Embedded Metadata Manifesto":

Metadata is essential to describe, identify and track digital media and should be applied to all media items which are exchanged as files or by other means such as data streams.

Media file formats should provide the means to embed metadata in ways that can be read and handled by different software systems.

Metadata fields, their semantics (including labels on the user interface) and values, should not be changed across metadata formats.

Copyright management information metadata must never be removed from the files.

Other metadata should only be removed from files by agreement with their copyright holders.

More details about these principles:

1: All people handling digital media need to recognise the crucial role of metadata for business. This involves more than just sticking labels on a media item. The knowledge which is required to describe the content comprehensively and concisely and the clear assertion of the intellectual ownership increase the value of the asset. Adding metadata to media items is an imperative for each and every professional workflow.

2: Exchanging media items is still done to a large extent by transmitting files containing the media content and in many cases this is the only (technical) way of communicating between the supplier and the consumer. To support the exchange of metadata with content it is a business requirement that file formats embed metadata within the digital file.

Other methods like sidecar files are potentially exposed to metadata loss.

3: The type of content information carried in a metadata field, and the values assigned, should not depend on the technology used to embed metadata into a file. If multiple technologies are available for embedding the same field the software vendors must guarantee that the values are synchronized across the technologies without causing a loss of data or ambiguity.

4: Ownership metadata is the only way to save digital content from being considered orphaned work. Removal of such metadata impacts on the ability to assert ownership rights and is therefore forbidden by law in many countries.

5: Properly selected and applied metadata fields add value to media assets. For most collections of digital media content descriptive metadata is essential for retrieval and for understanding. Removing this valuable information devalues the asset.

53 See http://www.embeddedmetadata.org/embedded-metatdata-manifesto.php

4. NUCLEAR KNOWLEDGE MANAGEMENT AND