In representation learning, knowledge graph embedding (KGE), also called knowledge representation learning (KRL), or multi-relation learning,[1] is a machine learning task of learning a low-dimensional representation of a knowledge graph's entities and relations while preserving their semantic meaning.[1][2][3] Leveraging their embedded representation, knowledge graphs (KGs) can be used for various applications such as link prediction, triple classification, entity recognition, clustering, and relation extraction.[1][4]
A knowledge graph G = { E , R , F } {\displaystyle {\mathcal {G}}=\{E,R,F\}} is a collection of entities E {\displaystyle E} , relations R {\displaystyle R} , and facts F {\displaystyle F} .[5] A fact is a triple ( h , r , t ) ∈ F {\displaystyle (h,r,t)\in F} that denotes a link r ∈ R {\displaystyle r\in R} between the head h ∈ E {\displaystyle h\in E} and the tail t ∈ E {\displaystyle t\in E} of the triple. Another notation that is often used in the literature to represent a triple (or fact) is < h e a d , r e l a t i o n , t a i l > {\displaystyle <head,relation,tail>} . This notation is called resource description framework (RDF).[1][5] A knowledge graph represents the knowledge related to a specific domain; leveraging this structured representation, it is possible to infer a piece of new knowledge from it after some refinement steps.[6] However, nowadays, people have to deal with the sparsity of data and the computational inefficiency to use them in a real-world application.[3][7]
The embedding of a knowledge graph is a function that translates each entity and each relation into a vector of a given dimension d {\displaystyle d} , called embedding dimension.[7] It is even possible to embed the entities and relations with different dimensions.[7] The embedding vectors can then be used for other tasks.
A knowledge graph embedding is characterized by four aspects:[1]
All algorithms for creating a knowledge graph embedding follow the same approach.[7] First, the embedding vectors are initialized to random values.[7] Then, they are iteratively optimized using a training set of triples. In each iteration, a batch of size b {\displaystyle b} triples is sampled from the training set, and a triple from it is sampled and corrupted—i.e., a triple that does not represent a true fact in the knowledge graph.[7] The corruption of a triple involves substituting the head or the tail (or both) of the triple with another entity that makes the fact false.[7] The original triple and the corrupted triple are added in the training batch, and then the embeddings are updated, optimizing a scoring function.[5][7] Iteration stops when a stop condition is reached.[7] Usually, the stop condition depends on the overfitting of the training set.[7] At the end, the learned embeddings should have extracted semantic meaning from the training triples and should correctly predict unseen true facts in the knowledge graph.[5]
The following is the pseudocode for the general embedding procedure.[9][7]
algorithm Compute entity and relation embeddings input: The training set S = { ( h , r , t ) } {\displaystyle S=\{(h,r,t)\}} , entity set E {\displaystyle E} , relation set R {\displaystyle R} , embedding dimension k {\displaystyle k} output: Entity and relation embeddings initialization: the entities e {\displaystyle e} and relations r {\displaystyle r} embeddings (vectors) are randomly initialized while stop condition do S b a t c h ← s a m p l e ( S , b ) {\displaystyle S_{batch}\leftarrow sample(S,b)} // Sample a batch from the training set for each ( h , r , t ) {\displaystyle (h,r,t)} in S b a t c h {\displaystyle S_{batch}} do ( h ′ , r , t ′ ) ← s a m p l e ( S ′ ) {\displaystyle (h',r,t')\leftarrow sample(S')} // Sample a corrupted fact T b a t c h ← T b a t c h ∪ { ( ( h , r , t ) , ( h ′ , r , t ′ ) ) } {\displaystyle T_{batch}\leftarrow T_{batch}\cup \{((h,r,t),(h',r,t'))\}} end for Update embeddings by minimizing the loss function end while
These indexes are often used to measure the embedding quality of a model. The simplicity of the indexes makes them very suitable for evaluating the performance of an embedding algorithm even on a large scale.[10] Given Q {\displaystyle {\ce {Q}}} as the set of all ranked predictions of a model, it is possible to define three different performance indexes: Hits@K, MR, and MRR.[10]
Hits@K or in short, H@K, is a performance index that measures the probability to find the correct prediction in the first top K model predictions.[10] Usually, it is used k = 10 {\displaystyle k=10} .[10] Hits@K reflects the accuracy of an embedding model to predict the relation between two given triples correctly.[10]
Hits@K = | { q ∈ Q : q < k } | | Q | ∈ [ 0 , 1 ] {\displaystyle ={\frac {|\{q\in Q:q<k\}|}{|Q|}}\in [0,1]}
Larger values mean better predictive performances.[10]
Mean rank is the average ranking position of the items predicted by the model among all the possible items.[10]
M R = 1 | Q | ∑ q ∈ Q q {\displaystyle MR={\frac {1}{|Q|}}\sum _{q\in Q}{q}}
The smaller the value, the better the model.[10]
Mean reciprocal rank measures the number of triples predicted correctly.[10] If the first predicted triple is correct, then 1 is added, if the second is correct 1 2 {\displaystyle {\frac {1}{2}}} is summed, and so on.[10]
Mean reciprocal rank is generally used to quantify the effect of search algorithms.[10]
M R R = 1 | Q | ∑ q ∈ Q 1 q ∈ [ 0 , 1 ] {\displaystyle MRR={\frac {1}{|Q|}}\sum _{q\in Q}{\frac {1}{q}}\in [0,1]}
The larger the index, the better the model.[10]
Knowledge graph completion (KGC) is a collection of techniques to infer knowledge from an embedded knowledge graph representation.[11] In particular, this technique completes a triple inferring the missing entity or relation.[11] The corresponding sub-tasks are named link or entity prediction (i.e., guessing an entity from the embedding given the other entity of the triple and the relation), and relation prediction (i.e., forecasting the most plausible relation that connects two entities).[11]
Triple Classification is a binary classification problem.[1] Given a triple, the trained model evaluates the plausibility of the triple using the embedding to determine if a triple is true or false.[11] The decision is made with the model score function and a given threshold.[11] Clustering is another application that leverages the embedded representation of a sparse knowledge graph to condense the representation of similar semantic entities close in a 2D space.[4]
The use of knowledge graph embedding is increasingly pervasive in many applications. In the case of recommender systems, the use of knowledge graph embedding can overcome the limitations of the usual reinforcement learning,[12][13] as well as limitations of the conventional collaborative filtering method.[14] Training this kind of recommender system requires a huge amount of information from the users; however, knowledge graph techniques can address this issue by using a graph already constructed over a prior knowledge of the item correlation and using the embedding to infer from it the recommendation.[12] Drug repurposing is the use of an already approved drug, but for a therapeutic purpose different from the one for which it was initially designed.[15] It is possible to use the task of link prediction to infer a new connection between an already existing drug and a disease by using a biomedical knowledge graph built leveraging the availability of massive literature and biomedical databases.[15] Knowledge graph embedding can also be used in the domain of social politics.[4]
Given a collection of triples (or facts) F = { < h e a d , r e l a t i o n , t a i l > } {\displaystyle {\mathcal {F}}=\{<head,relation,tail>\}} , the knowledge graph embedding model produces, for each entity and relation present in the knowledge graph a continuous vector representation.[7] ( h , r , t ) {\displaystyle (h,r,t)} is the corresponding embedding of a triple with h , t ∈ I R d {\displaystyle h,t\in {\rm {I\!R}}^{d}} and r ∈ I R k {\displaystyle r\in {\rm {I\!R}}^{k}} , where d {\displaystyle d} is the embedding dimension for the entities, and k {\displaystyle k} for the relations.[7] The score function of a given model is denoted by f r ( h , t ) {\displaystyle {\mathcal {f}}_{r}(h,t)} and measures the distance of the embedding of the head from the embedding of tail given the embedding of the relation. In other words, it quantifies the plausibility of the embedded representation of a given fact.[5]
Rossi et al. propose a taxonomy of the embedding models and identifies three main families of models: tensor decomposition models, geometric models, and deep learning models.[5]
The tensor decomposition is a family of knowledge graph embedding models that use a multi-dimensional matrix to represent a knowledge graph,[1][5][18] that is partially knowable due to gaps of the graph describing a particular domain thoroughly.[5] In particular, these models use a third-order (3D) tensor, which is then factorized into low-dimensional vectors that are the embeddings.[5][18] A third-order tensor is suitable for representing a knowledge graph because it records only the existence or absence of a relation between entities,[18] and so is simple, and there is no need to know a priori the network structure,[16] making this class of embedding models light, and easy to train even if they suffer from high-dimensionality and sparsity of data.[5][18]
This family of models uses a linear equation to embed the connection between the entities through a relation.[1] In particular, the embedded representation of the relations is a bidimensional matrix.[5] These models, during the embedding procedure, only use the single facts to compute the embedded representation and ignore the other associations to the same entity or relation.[19]
The geometric space defined by this family of models encodes the relation as a geometric transformation between the head and tail of a fact.[5] For this reason, to compute the embedding of the tail, it is necessary to apply a transformation τ {\displaystyle \tau } to the head embedding, and a distance function δ {\displaystyle \delta } is used to measure the goodness of the embedding or to score the reliability of a fact.[5]
f r ( h , t ) = δ ( τ ( h , r ) , t ) {\displaystyle {\mathcal {f}}_{r}(h,t)=\delta (\tau (h,r),t)}
Geometric models are similar to the tensor decomposition model, but the main difference between the two is that they have to preserve the applicability of the transformation τ {\displaystyle \tau } in the geometric space in which it is defined.[5]
This class of models is inspired by the idea of translation invariance introduced in word2vec.[7] A pure translational model relies on the fact that the embedding vector of the entities are close to each other after applying a proper relational translation in the geometric space in which they are defined.[19] In other words, given a fact, the embedding of the head plus the embedding of the relation should equal the embedding of the tail.[5] The closeness of the entities embedding is given by some distance measure and quantifies the reliability of a fact.[18]
It is possible to associate additional information to each element in the knowledge graph and their common representation facts.[1] Each entity and relation can be enriched with text descriptions, weights, constraints, and others in order to improve the overall description of the domain with a knowledge graph.[1] During the embedding of the knowledge graph, this information can be used to learn specialized embeddings for these characteristics together with the usual embedded representation of entities and relations, with the cost of learning a more significant number of vectors.[5]
This family of models, in addition or in substitution of a translation they employ a rotation-like transformation.[5]
This group of embedding models uses deep neural network to learn patterns from the knowledge graph that are the input data.[5] These models have the generality to distinguish the type of entity and relation, temporal information, path information, underlay structured information,[19] and resolve the limitations of distance-based and semantic-matching-based models in representing all the features of a knowledge graph.[1] The use of deep learning for knowledge graph embedding has shown good predictive performance even if they are more expensive in the training phase, data-hungry, and often required a pre-trained embedding representation of knowledge graph coming from a different embedding model.[1][5]
This family of models, instead of using fully connected layers, employs one or more convolutional layers that convolve the input data applying a low-dimensional filter capable of embedding complex structures with few parameters by learning nonlinear features.[1][5][19]
This family of models uses capsule neural networks to create a more stable representation that is able to recognize a feature in the input without losing spatial information.[5] The network is composed of convolutional layers, but they are organized in capsules, and the overall result of a capsule is sent to a higher-capsule decided by a dynamic process routine.[5]
This class of models leverages the use of recurrent neural network.[5] The advantage of this architecture is to memorize a sequence of fact, rather than just elaborate single events.[41]
The machine learning task for knowledge graph embedding that is more often used to evaluate the embedding accuracy of the models is the link prediction.[1][3][5][6][7][19] Rossi et al.[5] produced an extensive benchmark of the models, but also other surveys produces similar results.[3][7][19][26] The benchmark involves five datasets FB15k,[9] WN18,[9] FB15k-237,[42] WN18RR,[37] and YAGO3-10.[43] More recently, it has been discussed that these datasets are far away from real-world applications, and other datasets should be integrated as a standard benchmark.[44]
{{cite web}}