4. NUCLEAR KNOWLEDGE MANAGEMENT AND SEMANTIC TECHNOLOGIES
4.2. Semantic techniques particularly relevant to the nuclear field
Semantic technologies are particularly effective in providing common vocabularies, taxonomies and thesauri. As many groups of different stakeholders collaborate during the life cycle of a facility, sharing and transferring information is of great importance. By defining the terms used in KOSs, uncertainties with respect to the exact meaning of the term will be removed for all participants in different communities.
In addition, language barriers can be removed by translating the definitions into any language.
Efforts to establish a common language within the nuclear community have been undertaken on international, national and organizational levels. Many nuclear glossaries of technical terms, along with their formal definitions, have been produced by the standards development organizations, the IAEA, and many other national and international nuclear organizations. Prominent examples are the IAEA’s Safety Glossary and the International Nuclear Information System (INIS) developed by the IAEA and NEA, which provides a large thesaurus of nuclear terms in many languages.
Within a nuclear organization, the effort of structuring knowledge in the form of KOSs is often undertaken in several areas (e.g. management, administration, safety analysis, training, operation, maintenance, engineering, construction, design, supply), and sometimes even on several levels within each area. For instance, one single nuclear power plant (NPP) might operate tens if not hundreds of different databases and information systems, running on various platforms of differing quality. These independently developed KOSs usually contain formal definitions of many terms in different glossaries, leading to overlaps and inconsistencies. This is even more the case when looking at national or international levels, where many examples of different or even contradicting terms may be found in several glossaries. As these technical terms are used in licensing documents, contracts, specifications, design documents and others, problems may arise when interpreting and applying these terms.
Semantic technologies offer solutions to the problem of managing KOSs consistently, and support many additional features. The prerequisite for applying such technologies is the formulation of the KOS in a standardized way. The SKOS standard, in particular, is the best choice for addressing the task of managing vocabularies, glossaries, taxonomies and thesauri. More involved modelling, making use of the extended capabilities of OWL, is usually not required for managing these types of KOSs;
the special knowledge required of the OWL language and its features would only be needed when developing ontologies.
In practice, the transfer of an existing vocabulary to SKOS should not present many difficulties, as many vocabularies and thesauri exist in some structured form within MS Word, as PDF files, or in Excel sheets, which usually can be imported by SKOS editors.
Once a vocabulary has been translated into SKOS, the following features are available:
—Linking between one or more projects: The concepts of different KOSs may be linked (related to each other) in several ways offered by the SKOS standard: exact matching concepts, near matching concepts, narrower and broader related concepts, related concepts (a general unspecific relationship).
—Comparing definitions and scopes between different KOSs: By linking KOSs, their definitions or other attributes such as scope notes may be compared. Variations in definitions of exact matching terms can thereby easily be detected, and homogenized, if necessary (in many cases, differing definitions will reflect the usage of the same term by different user groups and may well be valid within their scope).
—Integrating concepts defined in other schemas: Several widely used and accepted ontologies or custom schemes have been developed for the web. These vocabularies should be reused whenever possible in place of recreating existing ones.
—Attaching linked data sources: As previously shown, the number of linked data sources is growing at a fast pace. In particular, well known sources such as DBpedia or Geonames may provide additional information to a KOS without much effort on the developer’s side. In addition, many organizations
are increasingly publishing their subject matter KOSs either publicly or on private networks. By linking to these KOSs, the original model may be greatly extended to provide a rich interwoven network able to provide related knowledge to the user (‘knowledge discovery’).
—Multilingualism and synonyms: In SKOS, a concept is identified by its URI by which it can be retrieved and referred to. For every language considered, one preferred label is attached. In addition, alternative labels may be defined, as well as hidden labels.
—Ownership and property rights: Apart from the technical solutions, issues of ownership and property rights have to be observed when interlinking with KOSs from different sources. A reasonable approach would be one in which content owners, who retain the copyright and the control of the content, update and maintain their respective materials, while end users are able to search and retrieve the definition(s) of terms of interest, trace the source of the terms, link to relevant information and publications where the terms were used and access previous revision(s) of terms from the same originator/owner.
Given that KOSs can be linked to produce knowledge descriptions with increasingly larger scopes, the question arises whether, ultimately, a description of the whole nuclear domain could be realized. In theory, it does appear that such an endeavour is possible, provided that the single KOSs adhere to linked data principles, which would enable KOSs developed by different parties to be interlinked. Examples of very large structures have already been established in the medical and biological domain. As producing an in-depth, comprehensive coverage of the whole nuclear domain may require some time, even an effort on a lower scale with less detail would start providing significant benefits: a common (multilingual) vocabulary that can be used in many fields for documentation, communication and development of applications.
4.2.2. Developing knowledge organization systems
Techniques such as mind maps or concept maps are well-established approaches for mapping knowledge (organizing and linking topics and items to form a network). These approaches might be regarded as precursors to modelling knowledge by means of KOSs. KOSs developed according to standards offer significant advantages when describing large, interrelated knowledge domains, since they can be processed by machines, with almost no limitations in extendability and linking.
Describing a complex and extensive knowledge domain, which is usually the case in the nuclear field, requires developing extensive and detailed KOSs. Good practices are available guiding the development of KOSs, usually including the steps of collecting important concepts for the subject under consideration, structuring the concepts in classes and hierarchies, defining attributes for each concept and relations between concepts, and linking concepts to other KOSs.
Still, manually developing a KOS is labour-intensive and costly. However, advances in text analysis and term extraction offer substantial help: a first seed taxonomy of limited complexity and scope might be generated by harvesting available linked data sources and manually refined. In a next step, a body of documents related to the subject at hand can be analysed by text mining techniques, whereby terms significant in this context are extracted and offered to the developer as items to be included in the taxonomy. Such an analysis may be particularly helpful with specialized document corpora which are closely related to a particular, well defined subject, such as a given class of accidents. After some iteration, the number of terms found by text analysis will decrease, indicating that a comprehensive description of the knowledge domain has been achieved.
Today’s search engines make use of methods that allow for (semi-)automatic construction and refinement of knowledge models from information on the Web. For specialized knowledge domains such as the nuclear, such methods are not yet fully applicable. However, the massive effort in developing methods based on artificial intelligence will produce rapid progress in this area.
4.2.3. Integration of heterogeneous knowledge sources
The nuclear domain is particularly well documented, because the organizations in that domain operate in a complex, strongly regulated environment. Every step in the life cycle of a nuclear installation has to be properly documented for purposes of maintaining records on design, the history of the design basis, further developments within phases of the life cycle, and for fulfilling regulatory requirements.
A multitude of content management systems and other document repositories are used for this documentation. Even within a single organization, many document repositories exist, distributed among different organizational units, sometimes in different countries and different languages, resulting in a variety of non-interoperable data silos. On a national and international level, this situation is obviously much more prominent.
However, many of the documents and much of the data residing in one repository are related to information in other repositories. Nowadays, the potential of combining data, even very large amounts of data (‘big data’), to extract new information that was not available before, is increasingly recognized.
An example of the value semantic engines could provide by integrating different data sources is in analysing minor incidents at a NPP. Correlating the data from several repositories such as data on the kind of education and training personnel involved in the incident had undergone, the average age of the crew, shift experience with comparable events and other influencing factors could help better determine the root cause of such incidents and the development of appropriate corrective measures.
In view of such potential gains, organizations strive to break the barriers of non-interoperable, heterogeneous data sources. Semantic technologies play a prominent role in integrating heterogeneous sources or making them interoperable. The means of achieving these goals consist in several steps: (i) the data structures of the repositories are analysed (software tools may support this task, e.g. for relational databases), (ii) data from the source system is extracted, (iii) transformed to a format amenable to semantic tools (e.g. RDF), and (iv) loaded into a target system. Software for extraction, transformation and loading (ETL software) is available to perform this task; the RDF triples are stored in No-SQL-databases (triple stores), or even processed on the fly. The consolidated content of the different sources can then be queried by SPARQL, allowing for the issuing of queries on the whole of the triple store, containing all information sources.
4.2.4. Automated indexing, categorization and tagging
Keywords and key phrases consist of a sequence of one or more words that provide a compact representation of a document’s content. Keywords represent in condensed form the essential content (main topics) of a document. Extracting the most relevant words or phrases (keywords) in a document is one of the most essential tasks as they are an important form of metadata. But they also have a variety of indirect uses in performing such tasks as indexing, text classification and categorization, tagging, text summarization, content-based retrieval, topic search, navigation and KOS creation. The effectiveness and relevance depend on discovering a sufficient number of quality key phrases from the text. Though the manual process of providing keywords would be appropriate, the effort and time required with a large collection of data and information resources render it unfeasible. Hence, automatically assigned key phrases provide a highly informative semantic dimension to the documents that would benefit new applications.
Key phrase/keyword extraction involves expressing the main topics using the most prominent words and phrases in a document. Key phrase indexing refers to a more general task where the source of terminology is not restricted as in keyword indexing. Automated indexing is based on natural language processing (NLP) and uses text analytics and mining to extract the entities and list them in the index. It may be done with or without the use of KOS vocabularies (KOS-supported extraction usually produces better results). Key phrase extraction with text analytics can generate terms from text as a source to manually build taxonomies and also to categorize and classify contents against the existing terms in the taxonomy. Automated categorization/classification/tagging performs text analytics or mining to extract
concepts from unstructured varied content and makes use of the controlled vocabulary to match to the extracted terms. It is also called term assignment.
The methods of auto-categorization technologies primarily fall into two categories: machine learning based and rules based. The hybrid method makes use of both. Machine learning based methods automatically categorize on the basis of previous examples. They apply complex mathematical algorithms to learn from multiple representative samples. The system works best if a large collection of pre-indexed records already exists. Rules based auto-categorization works based on rules created for each term in the controlled vocabulary. Rules are generally based on synonyms with more conditions. These systems feature more sophisticated rule writing like advanced Boolean searching, regular expressions and proximity operators.
The relevance for nuclear organizations is obvious: as the number of document and data sources steadily increases, the aforementioned tasks such as classification, retrieval and navigation are increasingly dependent on metadata, which cannot possibly be assigned manually.
4.2.5. Semantic search and artificial intelligence techniques
Semantic search is based on the context, substance, intent and concept of the searched phrase. It not only finds keywords, but determines the intent and context meaning of the search words. The focus is on applying the appropriate context for quickly locating the information users are seeking. The context based search will help users to obtain more meaningful results by finding the most relevant documents in the repository.
Controlled vocabularies are designed to support consistent indexing and end user navigation, browsing and searching. A controlled vocabulary gathers synonyms, acronyms, variant spellings, relations and so on. Semantic search engines utilizing controlled vocabularies enable users to search for information by different names or even misspelled terms and retrieve matching concepts, not just words or phrases. Without semantics, mere text string keyword matches may miss the more relevant concepts and retrieve too much irrelevant content.
The advanced semantic search supports all available metadata fields along with terms from controlled vocabularies. It combines concept based search (restricted to controlled vocabularies) and keyword search in metadata or full text to perform a more focused search on specific categories and concepts. Applying synonymy and homonymy, stemming as a part of the searching process, further enhances the accuracy and relevance of the search results. Features such as query extensions (extending the query by synonyms and/or relations to other terms), autocompletion of search terms (offering search suggestions when starting to type), or taxonomy driven facets to quickly refine and drill down search results are typically provided.
For keyword searches without consideration of term interdependencies, information retrieval models like the standard or extended Boolean models, vector space models, probabilistic models or inference networks are generally used. Reaching out further, generalized and enhanced vector space models, latent semantic and neural network based information retrieval models are capable of representing terms with interdependencies. The learning and generalization capabilities of artificial neural networks are used to build up and employ application domain knowledge required for adaptive information retrieval. In addition, evolutionary techniques like the genetic algorithm can be used to increase information retrieval efficiency and optimize the search results.
The first case study presented in the Annex discusses a realization of such systems with advanced search and retrieval features within a nuclear organization, the Indira Gandhi Centre for Atomic Research (IGCAR). It clearly shows the feasibility of such systems today and the benefits they convey.
4.2.6. Visualization
Knowledge models may quickly become large, with many concepts, attributes and relations.
Keeping an overview of KOSs for development and maintenance work can be significantly facilitated by
appropriate visualizations. These vizualizations can be further utilized in representing the results derived by querying the knowledge base through SPARQL.
In the case of a simple KOS with few relations, a hierarchical tree visualization may be sufficient.
For gaining insights into the structure of more complex ontologies, graph visualizations are usually deployed (see Fig. 8). Graph visualizations may be built into the semantic tools used for developing KOSs or attached to an exported RDF file or to a triple store. Several tools explicitly import RDF formats;
for some others a conversion to other graph representation formats is necessary. Graph tools commonly offer several types of layouts, some of them particularly well suited for displaying very large graphs, and provide customizable settings for labels, colours, node and link representations, and many more.
Zooming and filtering are indispensable for quickly accessing the requested information.
Visualizing the results of SPARQL queries is mostly equivalent to classical methods of visualizing tabular data (e.g. as line charts, pie charts, bar charts and many other graphical displays). In many instances, the results returned by SPARQL queries have to be transformed into the formats requested by the visualization; however, advanced SPARQL query systems offer direct visualization by choosing the particular graphic display that the user requests.
4.2.7. Text analytics, data mining and knowledge discovery
Text analytics and data mining are defined as the processes of discovering patterns (knowledge) in data. The patterns are valid, understandable, implicit, non-trivial, previously unknown (novel) and potentially useful. The process is generally automatic or semi-automatic as it is difficult to handle huge data manually with traditional techniques. Data mining involves application of more sophisticated computational intelligence algorithms and machine learning techniques in addition to statistical analysis techniques and natural language processing for knowledge discovery.
FIG. 8. Graph visualization of part of a knowledge model.
A knowledge discovery process includes data cleaning, data integration, data selection, transformation, data modelling, pattern evaluation and knowledge presentation. Knowledge discovery techniques perform data analysis and may uncover important data patterns, contributing greatly to knowledge bases and scientific research. This process needs to handle relational and diversified types of data to extract meaningful information. With suitable knowledge representation, real world knowledge can be used for problem solving and reasoning.
The application of hybrid methodologies of soft computing, including neural networks, fuzzy systems and evolutionary computing, provides the power to extract and express knowledge contained in data sets in multiple ways. It improves the performance in terms of accuracy and generalization capability while dealing with highly dimensional, complex regression and classification problems.
The knowledge discovery tools and models help one to discover and extract meaningful knowledge in the form of human interpretable relationships/patterns that are buried in the unstructured or structured document collection, and represent the knowledge components in appropriate ways to facilitate operations like storage, retrieval, inference and reasoning.
The essential aspect for nuclear organizations consists in the capability of these methods to extract new insights and knowledge from the huge amount of data continuously produced by all information sources. Correlating data sources that are currently treated as separate from each other can produce novel and unidentified information (e.g. in the analysis of events or in crisis management), as witnessed by the ever growing big data applications in many scientific and technical domains.