Augmented Post systems (APS) are string-operating Prolog-like knowledge representation, affiliated with the “Set of Strings” Framework (SSF). APS descriptive and logical inference capabilities provide natural integration of Big Data with online analytic processing. This chapter is dedicated to strict formal definition of APS syntax, mathematical and operational semantics, and to its most valuable implementational issues, as well as to APS application to Big Data, Internet of Things, cyberphysical industry, and cybersecurity areas.
- “Set of Strings” framework
- augmented Post systems
- string data bases
- data and knowledge engineering
- Big Data
- Internet of Things
The background of “set of strings” Framework (SSF) [1, 2, 3, 4], developed for adequate modeling of Big Data, is the representation of database as an updated Set of Strings, whose structure is defined by the current metadatabase (MDB), being a set of context-free generation rules in the sense of N. Chomsky .
Augmented Post systems (APS), considered in this chapter, are result of deep integration in the one toolkit of two well-known formalisms—classical string-generating formal grammars and string-operating logical systems, proposed by E. Post .
By this, the main features of APS are:
Deep connectivity with “Set of Strings” representation of databases.
The second feature provides selection of not only data, presenting in string databases (SDB) in the explicit form, but also data, which may be constructed (derived) from the explicit data by means of logical inference.
This chapter is dedicated to the formal definition and substantial description of APS and their applications. The terminology and abbreviations are the same, as have been used when describing SSF.
Section 2 is dedicated to the definition of syntax and mathematical semantics of APS, and Section 3 to their operational semantics. Procedural connection to APS is described in Section 4, and flow-processing APS (FAPS), based on the aforementioned connection, are described in Section 5. Implementation issues and applications of APS family are considered in Section 6. The conclusion contains proposals on further development of SSF.
2. Syntax and mathematical semantics of the augmented Post systems
The augmented Post systemis the composition of set S of the so-called augmented, or string (S-), productions, and set D of context-free generating rules, being metadatabase. Following the terminology of knowledge engineering, we shall call P also as APS-represented knowledge base (for short, APS KB).
S-productionis couple, the first component of which, named body, has the form of postproduction:
where and are terms.
Term is called header, while terms are called conditions. The set of conditions is unordered, i.e., S-productions:
where are identical. The aforementioned set of conditions may be empty in general case, i.e., . The S-production with the empty conditions set looking like
is called S-axiom.
Component of S-production is set of rules and is called here variables declaration.
Rule is called variable declaration of, which defines set of possible values (domain) of this variable:
where is CF-grammar, corresponding to metadatabase . It is essential that there is one and only one variable declaration for every variable , having place in the body of S-production. S-production which body has no any variables, i.e.,
for all , is called concrete S-production (for short CS-production). CS-production:
such that is called CS-axiom.
S-production defines set of CS-productions in the following way:
where are strings in terminal alphabet , which are created by the replacement of variables by strings in the same alphabet, respectively; all occurrences of one and the same variable in all terms are replaced by one and the same string .
then S-production is correct to MDB . This means that all terms of every CS-production from set are words of context-free language , and where G is context-free grammar with set of generating rules D.
APS KB is correct, if all S-productions are correct to metadatabase .
Notion of the APS KB extensional, i.e., set of facts, which may be derived from the knowledge base by means of logical inference, is defined as follows.
Let be correct APS KB. Consider
and extensional , i.e., set of facts, defined by this knowledge base, is fixed point of the sequence , which is in general case infinite:
Evidently, due to P correctness,
and is finite, if their exists such, that
If we consider the notion of extensional of APS KB from the linguistic point of view, then is a language in alphabet which, according to (15), is sublanguage of . At the same time, from the point of view of knowledge engineering, is the join of set of facts , which are known explicitly (they are called lower ground facts), and set of facts, which are derived from ground facts and/or another derived facts. As it is easy to see, set of ground facts is nothing else, than SDB, while S-productions with nonempty sets of conditions form APS KB intensional, providing the aforementioned inference.
Now we can define semantics of language of queries to APS KB.
Set-theoretical (S-) semantics of this language is similar to S-semantics of SDB query languages:
where is APS KB extensional and is set of facts, which actuality check is the purpose of the query.
Here we shall use the simplest query language, being set of couples , where is term and is its variable declaration.
Mathematical (M-) semantics of this language is based on (14) and the following evident equation:
Here, is correct APS KB, which one-element set of S-productions—selection criterion—contains S-axiom, defining set of facts, which may belong to .
Example 1. Consider metadatabase from Example 2 of the chapter of this book, describing SSF. We shall add to it the following three rules:
Facts like contain information about sensors, which gather and send to the fusion center the data about the state of the atmosphere in the surrounding area. Facts like contain information about sensors’ location.
Consider APS knowledge base , where MDB was described higher, and contains following S-productions:
As seen, are concrete S-productions, and set , corresponding to this APS KB, consists of ground facts, being headers of these CS-productions.
Query, whose purpose is to get information about smoked areas, may be as follows:
while query, whose purpose is to get information about sensor location, may look like
Evidently, this knowledge base extensional is
3. Operational semantics of APS
The background of operational (O-) semantics of the considered query language is the so-called S-unification, which is generalization of well-known unification, introduced by J. Robinson in  and used in the resolution procedures, developed for the first-order predicate logic. (In fact, S-unification is a strict generalized definition of heuristically described unification, which, as it was shown in , is incorrect in the case of multiple occurrences of at least one variable to one of the unified terms.)
We shall consider basic concept of S-unification and then describe operational semantics of used query language, corresponding to (11–18) and defining main features of the controlled logical inference of answers to queries to APS KB.
Let be a query and is couple, formed from S-production:
As may be seen easily, the answer to may contain headers of the CS-productions, being concretizations of , i.e., belonging to set , if
This condition is equivalent to the existence of solution of word equation on context-free language (WECFL):
(for correctness, sets of variables of terms and would not intersect, that is, simply implemented by variables renaming, if this condition is not satisfied).
Result of WECFL (21) solution is the new declaration of variables of S-production , “concreted” in accordance with query . This concretization will be denoted as it is obtained by replacing variable declarations of term by new ones , which are more informative (“concrete,” “precise”), regarding initial declarations in This concretization is adequate to query in such a way, that CS-productions, whose headers may enter the answer to this query, may be created at the following steps of inference.
Logics of S-unification are illustrated in Figure 1, followed by Example 2.
Example 2. Consider query , where
and S-production from Example 10, where
According to (21),
S-unification provides opportunity of answer derivation by controlled logical inference, which is implemented by the axiomatic system, which contains only three inference rules—top-down successful S-resolution, top-down unsuccessful S-resolution, and bottom-up S-resolution [1, 2]. Here we shall describe more natural and understandable procedural representation of the considered O-semantics, which fully corresponds to the aforementioned axiomatics.
This representation contains two recursively interconnected algorithmically defined functions.
The first of them has two input variables and , which values represent query , and one output variable A, which value, returned after termination, is the answer to this query, as defined by (17).
The second function has also two input variables, and , which values represent, respectively, the so-called residual body of S-production (subset of body of S-production, including terms, which until current call were not used for the derivation of new queries) and variable declaration, obtained after previous steps of logical inference. Function returns set of variable declarations, each corresponding to one element of answer to the query , where is the next term, extracted from . By this, input variables declaration is made “more precise.”
Text of function is as follows:
As seen, lines 2–4 contain declarations of variables, used lower in the function body. It is essential that there are two variables declared as global, namely, S (set of S-productions) and D (metadatabase). Nevertheless the last one is not used in the body, it is presumed that MDB is used inside S-unification (line 6). The initial value of local variable is an empty set.
The main part of Q is loop (lines 5–11) on S-productions, entering set S and represented in the form of triples , where is header; is body of S-production, being set of terms; and is its variable declaration. The first operator inside loop (line 6) implements S-unification of query and couple . If this S-unification is successful (i.e., its result is not a special value ), then loop on elements of set of variable declarations, obtained by function RB, is executed (lines 8–10). The values of input variables of function RB are (set of terms, entering body of the current S-production) and (result of the aforementioned S-unification). Each , being set of variable declarations of the form , where , is used for making facts by substitution right parts of those declarations to term , and all such facts are joined to the output set (line 9).
Text of function RB, in turn, is as follows:
Here lines 2–4, as in the text of Q, are declarations of variables, used in the RB body. RB execution begins by check, whether set is empty (line 5): if so, search of terms of S-production body is already finished (or, as partial case, it is empty because S-production is axiom), and the returned value is one-element set , where is the same, as input value. Otherwise loop on terms, entering residual body, is executed (lines 7–11). Every selected term is used in query to KB <S,D>, which is processed by recursive call Q (s,d), and loop on all facts, entering answer to this query, is executed (lines 8–10). The body of this loop contains single operator (line 9), providing accumulation of set as a result of recursive call of the same function RB with the first input variable being (i.e., set of terms of the body of the processed S-production, reduced by extraction of the processed term ) and the second input variable value being (i.e., S-production variables declaration, made “more precise” according to result of S-unification of query and fact ).
As it is easy to see, execution of loop provides generation of sequences of terms of body of S-production, i.e., all possible permutations of these terms.
Let be finite result of function Q application to the query and APS KB <S,D>. There is theorem on the correctness of the described operational semantics of the augmented Post systems .
In fact, Q and RB collaborative operation provides top-down derivation of the set of CS-productions, involved in creation of non-ground facts, which enter the answer to the query , in full accordance with bottom-up definition of M-semantics of APS. (Ground facts are simply selected by one-step application of Q and RB.)
As seen from this description, logical inference is implemented in a wavelike manner. “Direct waves,” including generated queries, which contain “more and more concrete” variables declarations, are “spreading” until S-axioms are reached. This leads to the creation of CS-productions, and “back waves” start until a new set of facts, being answer to some intermediate (or, finally, initial) query, is assembled. Both direct and back waves meet one another and “interfere,” producing new queries. As may be seen, APS operational semantics corresponds to the universal mode of inference (all facts, being goal of the query, are derived), unlike existential mode (anyone fact from the possibly multifact set), which is inherent to known resolution procedures. By this, full accordance of APS O-semantics to its S- and M-semantics is achieved.
4. Procedural connection to APS KB
One of the most important features of any knowledge representation model are embedded tools of the procedural connection, providing interaction of “rigid” (non-updated by knowledge engineers) software modules with inference process and data exchange between inference engine and the aforementioned modules.
Background of the procedural connection to APS knowledge bases are program (P-) productions.
P-production is couple where term is header and set variable declaration; and are declarative component of the procedural connection. Symbol is the divider, distinct from , and is the name of the connected program (software module), which has its own extensional .
Couple is called cover of the connected program. In general case one and the same program may be connected to APS KB by several P-productions with different covers.
Example 3. Program named MULT for multiplication of integer numbers may be connected to knowledge base by P-production:
MULT extensional is an infinite set containing strings like , etc.
i.e., is a join of extensionals of all connected programs and set of ground facts, being heads of CS-axioms. Until general case will be discussed, we shall consider programs with the so-called static extentionals, whose basic feature is for all period of query processing.
Concerning operational semantics, it is sufficient to extend function Q by operators, providing call of the “perspective” connected programs via unified application programs interface and joining resulting sets to the accumulated answers. The extension is as follows:
As seen, after search on S-production set (lines 5–11), similar search on P-production set would be executed (lines 12–16). If S-unification of query and program cover is successful, program call is executed, and its result—set of words, denoted —is joined to the accumulated set . This fully corresponds to (22).
As seen, search on the set is nondeterministic (there is no any predefined order on this set, except S-productions processing before P-productions; but in general case, this ordering, of course, is not mandatory). This obstacle creates some difficulties in implementation of calls of the connected programs, when input data are less informative, than it is necessary for computation (e.g., . By this, result of such call may be infinite set, even if there would be an opportunity to implement this operation using tools for N-facts processing, developed in [1, 2, 3, 4].
To avoid such difficulty, it would be reasonable to implant to S- and P-productions some information, which may cut off “dangerous” branches.
There is a simple tool for logical inference control within APS knowledge representation. Namely, variables, which the declarations after S-unification would have right parts, consisting only of terminal symbols (i.e., there would be no incomplete information, associated with non-terminals, inside these declarations), are marked by point over arrow in the initial descriptions, having place in P- and S-productions. Such variables are called complete. In example 3 declarations of variables and would be , so query
while inference will result in program MULT call, while query
will not, because of information incompleteness of the declaration of , which is a complete variable. Negative result of S-unification in the case of such incompleteness is, as higher, denoted by symbol ∇.
Let us consider now the general case, where extensionals of programs, connected to APS KB, are not static, i.e., may vary while answer derivation. The simplest examples of such kind of programs are database management systems.
To connect DBMS, operating key-addressed databases, it is sufficient to join to APS KB P-production:
for queries (variable , denoting key, is complete), as well as
for update operations (both variables and , corresponding to key and data, are complete).
DBMS, operating databases with symmetric access, may be connected in a following way. In order to minimize redundant search while queries processing, some substrings of facts may be declared indexed (used for creation and maintenance of dynamic search trees, as it is described in [2, 3]). So some covers of the relational DBMS, corresponding to some subsets of DB, may contain complete variables, with declarations like , where → u defines domain of the indexed attribute of the relation. Covers, providing inclusion to SDB, contain only complete variables. (If DBMS operate SDB with incomplete information, any variable may be incomplete.)
5. Flow-processing APS
Described approach to the procedural connection makes easy integration of the distributed hardware components, linked by the computer network, into a single system with the single operation process. This may be done by procedural connection of the corresponding hardware drivers and implementation of the so-called flow-processing augmented Post systems.
FAPS KB includes along with S- and P-productions also the so-called flow (F-) productions having the form of
where terms , called activator, and , called actor, contain only complete variables and symbol “” (inverse to “”, used in S-production) divides activator and body, including, along with actor, conditions . This structure is rather usual as well as operational semantics of F-productions, which is defined by function F, similar to Q and RB, and interconnected with them into integrated algorithmics.
We assume that F-productions are represented in the knowledge base as couples , where is activator, is set of terms-conditions, and is actor, while is variable declaration. By this, APS KB contains three types of objects, corresponding to three types of productions:
, where is set of terms (S-productions);
, where is string, being name of the connected program (P-productions);
All these objects are accumulated to set S, which, along with metadatabase D, present in the bodies of functions Q, RB, and F as global variable.
Function F is as follows:
As seen, the search on F-productions set (lines 4–10) provides selection of such F-productions, which are activated by input message . The set of conditions of every activated F-production is interpreted as residual body of S-production (lines 7–9). Every variable declaration , obtained as a result of this interpretation, is used for the creation of new message by substitution of to actor . This message is used as input value for function F recursive application. So the wave of messages, triggered by initial message, is generated, modeling well-known blackboard architecture. This wave propagation of any programs (including DBMS, software/hardware drivers, providing activation and operation of various devices, as well as networking middleware) may be applied. Operator terminate (line 11) stops F execution, so no return to the parent call is performed.
Various practice-oriented dialects of APS, providing various effective technologies of non-procedural programming (multiactivating, triggers, terms with negators, decision tables, etc.), were developed; also described operational semantics of APS and its extensions were redefined for highly parallel hardware environment [1, 2]. Taking into account nondeterministic nature of operational semantics of APS family, some additional tools for logical inference control were introduced in [1, 2].
Let us consider now the main features of APS application to the most advanced areas—Big Data, Internet of Things, cyberphysical industry, and cybersecurity—forming the future digital economy information infrastructure (DEII) [13, 14]. In all aforementioned applications, APS knowledge representation plays interconnecting and flexifying role, providing fast integration of various heterogeneous systems and fast adaptation of their operation logic to the highly volatile environment. In fact, APS simplify to the maximally possible level the most complicated problem of interoperability of any a priori developed systems; APS KB is nothing but “glue,” which in the simplest regular way integrates them together.
6. Implementational issues and applications of APS
DEII is the result of integration of three basic paradigms: network-centricity, Big Data, and Internet of Things (Figure 2).
Network-centricity provides interconnection of any subjects of digital economy (DE), integrating human society and technosphere into global technosocium, operating as a global megasystem for the mankind wealth and prosperity. Big Data provides storage of great amounts of data from multiple heterogeneous sources and use of these data for rational everyday operation of the aforementioned megasystem and its permanent improvement. The Internet of Things, whose more precise and correct name would be Internet of Devices, provides creation, development, and operation of the aforementioned future global technosphere, including cyberphysical industry, based on additive manufacturing technologies, deeply robotized smart logistics, smart energy generation and delivery systems, and life resources infrastructure (food, water, living houses, etc.), as well as digital banking and finance infrastructure. All these segments, containing many billions of interconnected devices, provide implementation of such ambitious initiatives as Smart City and Smart Nation, inevitably leading to the Smart World as a well-understood and achievable goal of mankind evolution.
The most complicated problem to be solved for DEII creation and development is providing its flexibility, i.e., rapid correction of operation logic of various DEII elements and sets of elements to the constantly occurring changes of the environment, as well as to changes in our knowledge about nature, human society, and technosphere. It is evident that sufficient flexibility of DEII may be achieved only on the background of knowledge engineering, leading to the knowledge-based digital economy.
As shown in Figure 3, every subject of the DEII (no matter, human, or device), being associated with unique address of the address space of global information infrastructure, is supported by local knowledge base (LKB), applied by knowledge interpreter-corrector (KIC) for processing of input information flow, as well as local database, used and updated while aforementioned processing. Here LKB may contain not only “soft” component, i.e., rules, defining logic of input messages processing, but also “firm” component, i.e., “rigid” (nonmodified) software modules and systems (up to DBMS clients and servers), connected to LKB by the common interface, and called while rules interpretation (their sets are marked CP, e.g. Connected Programs, with lower indices).
The described approach may be effectively implemented, if there would be some unified data item, providing unified representation of operation logic of any system, joining any set of DE subjects.
The possible practical outcome of the SSF is the creation of toolkit, providing the described higher approach to DEII implementation upon string as the aforementioned unified data item. This outcome in the integrated form is placed in Figure 4, where APS family of string-operation knowledge representation models is explicated.
The operational semantics of the simplest practically used model from this family, whose background is FAPS, is shown in Figure 5.
As seen, input messages, which are transported to the local subject of DEII by means of network infrastructure, are recorded on the external blackboard (in fact, there are two blackboards used: external, for messages from the external sources, and internal, for messages generated while current external message processing). Every next message (string) is navigated by special associative index (binary/ternary tree) to the activators . Selected activators are used for solving the corresponding WECFL (to simplify variable declarations and accelerate messages parsing, the so-called ultragrammars were introduced in [1, 2]), and if solutions do exist, they are transferred to the rest part of the F-production, in which the terms may provide access to databases with symmetric access (DBSA) or key-addressed databases (KADB) SQL-like or NoSQL queries and updates; check database integrity before updates (by applying constraints defined by the corresponding sets of productions); activate devices by their drivers, connected to LKB by proper P-productions; perform some nontrivial processing by connected “firm” software products; and also send messages, which are created strings to the internal blackboard or, by lower-level communication software, to another local subject (i.e., their external blackboards).
As seen, described dialect of APS family provides flexible implementation of operations, necessary for Big Data and Internet of Things, as well as its cyberphysical manufacturing segment, i.e., industrial Internet of Things. In the simplest case, strings, being transferred to 3D printers by their driver’s calls, may be STL files containing layer descriptions of the printed material objects .
The main feature, which is, in fact, core for all APS family, is “additivity” of the local knowledge bases. Namely, the occurrence of any new device or human, generating earlier unknown and unprocessed messages, is supported by addition to LKB of subjects (again devices or humans), communicating with source of such messages, new elements with activators, providing initiation of their processing. Such occurrence results, finally, in creation and sending new messages to another subjects. Such “additivity” may be called “vertical” (new types of messages are supported by the addition of new F-productions with new activators to LKB). However, horizontal “additivity” is also supported, when new F-productions with proper activators are added for internal messages, which are sent by some modules of LKB while processing external messages. By this, internal messages processing becomes “deeper.” As seen, FAPS provide sufficiently regular and refined technology of the distributed system software development and debugging.
Let us underline that unlike well-known tools from the object programming area, which also support interconnection between modules by messages and use blackboard for such processing, in the case of flow APS, new message receivers are not known, and thus, there is no opportunity for its determination in the text of the program module. By this, APS-based knowledge engineering approach to software making principally differs from object programming as well as any other approaches from the more or less procedural programming.
Let us pay some attention to such important area of flow APS application as cybersecurity, which is critical for DEII normal operation.
Dynamically extended spectrum and complexity of cyberattacks, implementing today advanced persistent threats (APT), have led to the necessity of development of more and more sophisticated tools for early recognition and prevention of APT. The most efficient of such tools are based on the security information and event management (SIEM) paradigm [16, 17, 18, 19], whose background technologies are deep packets inspection/deep packets processing (DPI/DPP) [20, 21] and data leakage prevention (DLP) [22, 23]. The first operates flows of bit packets, providing their fusion in order to recognize elaborately covered signatures of known cyberattacks, while the second operates usually traffic from the application level of the OSI model. However, both operate strings (no matter, bit, symbol, or combined), and by this reason the described higher FAPS are the perfect tool for the SIEM implementation, especially at security operation centers (SOC), providing fusion of the primary data, passing to SOC from large amount of software and hardware traffic sensors from the security perimeter of the protected system. (Such sensors may be also effectively implemented on the FAPS background.) It is very important that SIEM is solidly based on three pillars: Big Data, which are stored fragments of the network traffic, caught at many network links in a 24 × 7 regime; data mining; and knowledge engineering. This area is the hardest known case of data mining and knowledge-based technologies application: cybersecurity knowledge engineers, by analyzing accumulated enormous volumes of primary data items, must quickly understand signatures of the prepared or already performed cyberattacks, developed by the smartest people from the global cybercrime and states special services teams; and rapidly correct LKBs or extend them by new sets of productions, providing early recognition and neutralizing such attacks. This combination of very large volumes of data and knowledge with hard real-time regime of SOC operation (e.g., SOC of the Russia largest financial group Sberbank neutralizes daily over 14,000 cyberattacks from all over the world ) makes cybersecurity application of data and knowledge engineering the most important from both practical and theoretical points of view.
The described “Set of Strings” Framework is based on integration of three research areas—knowledge/data engineering, theory of grammars, and information theory—which until now are very close but rather isolated from one another. Synergetic effect, which may be the main result of such integration, would be useful for the development of the unified theory, necessary for convergence of Big Data, data mining, artificial intelligence, and Internet of Things. Such convergence is extremely needed for solution of all spectrum of problems of digital economy.
The most valuable directions of the SSF future development may be the following:
Metainference and machine learning in APS KB.
N-facts concretization in SDB with incomplete information (SDBI) and APS KB, using functional dependencies, associated usually with the relational data model.
Contradictory SDBI and APS KB management.
SDBI and APS KB with metadatabases, describing two- and three-dimensional objects.
SSF ideology and techniques transfer to the numeric data with the help of the SSF-associated theory of recursive multisets and multiset grammars/metagrammars, developed for the solution of problems of planning and scheduling in digital economy [25, 26, 27, 28].
Hardware implementation of key elements of SSF in the data flow and other high-parallel computer environments.
The author is highly interested in the cooperation with computer scientists and engineers, who may spend some research effort participating in the development of the listed directions and the SSF at all.
The author expresses gratitude to the editor for useful advices. The author is grateful to Prof. Noam Chomsky for words of support to the development of the “Set of Strings” Framework. The author is also thankful to Prof. Jeffrey Ullman for useful remarks on the current state of the theory of grammars.