Kernel values too close to 1 after normalization
Hi Maxin. I want to fit a Gaussian Process regression model to predict the energy of many adsorption models consisting of more than 64 atoms (slab + adsorbates). I have already generated a list of graphs representing each adsorption model. The graphs consist of unweighted edges and labeled nodes. I attach the pickle file for a list of 10 graphs here test_random_graphs.pkl
I'm very intersted in your GraphDot code and decided to try it out to generate the marginalized graph kernel. Below are the code based on your example
import pickle
import numpy as np
import networkx as nx
from graphdot import Graph
from graphdot.kernel.marginalized import MarginalizedGraphKernel
from graphdot.microkernel import (
TensorProduct,
SquareExponential,
KroneckerDelta,
Constant
)
with open('test_random_graphs.pkl', 'rb') as f:
graphs = pickle.load(f)
mapping = {'Ni': 1, 'Pt': 2, 'H': 3, 'C': 4, 'O': 5, 'CO': 6, 'CH': 7, 'OH': 8, 'CHO': 9, 'COH': 10, 'CH2': 11, 'CH3': 12}
# Rename node attribute from symbol to category and add radius
for graph in graphs:
for n in graph.nodes:
graph.nodes[n]['category'] = graph.nodes[n].pop('symbol')
graph.nodes[n]['category'] = mapping[graph.nodes[n]['category']]
graph.nodes[n]['radius'] = 1.0
# define node and edge kernelets
knode = TensorProduct(radius=SquareExponential(0.5),
category=KroneckerDelta(0.5))
kedge = Constant(1.0)
# compose the marginalized graph kernel and compute pairwise similarity
mgk = MarginalizedGraphKernel(knode, kedge, q=0.05)
R = mgk([Graph.from_networkx(g) for g in graphs])
# normalize the similarity matrix
d = np.diag(R)**-0.5
K = np.diag(d).dot(R).dot(np.diag(d))
print(R)
print(K)
Output:
# R
[[10848.06054688 10181.26367188 9729.15820312 9808.81640625
8921.171875 9630.64648438 10301.10742188 10307.16503906
10580.24609375 9791.9140625 ]
[10181.26367188 9953.25097656 9786.89257812 9789.01660156
9469.56640625 9744.18652344 9989.49316406 9999.171875
10057.87109375 9800.08398438]
[ 9729.15820312 9786.89257812 9849.06152344 9785.03125
9867.24609375 9849.73046875 9778.36230469 9781.27734375
9682.53417969 9818.77050781]
[ 9808.81640625 9789.01660156 9785.03125 9752.86132812
9720.50585938 9777.7109375 9791.77246094 9802.02539062
9751.39453125 9775.53125 ]
[ 8921.171875 9469.56640625 9867.24609375 9720.50585938
10590.05078125 9926.23632812 9364.8515625 9422.70605469
9058.00976562 9809.98242188]
[ 9630.64648438 9744.18652344 9849.73046875 9777.7109375
9926.23632812 9885.59570312 9725.6640625 9712.53125
9585.69335938 9806.24023438]
[10301.10742188 9989.49316406 9778.36230469 9791.77246094
9364.8515625 9725.6640625 10048.13476562 10050.2890625
10149.59570312 9798.29101562]
[10307.16503906 9999.171875 9781.27734375 9802.02539062
9422.70605469 9712.53125 10050.2890625 10100.45507812
10183.31835938 9818.92382812]
[10580.24609375 10057.87109375 9682.53417969 9751.39453125
9058.00976562 9585.69335938 10149.59570312 10183.31835938
10396.46484375 9746.42089844]
[ 9791.9140625 9800.08398438 9818.77050781 9775.53125
9809.98242188 9806.24023438 9798.29101562 9818.92382812
9746.42089844 9813.734375 ]]
# K
[1. 0.97981291 0.94124307 0.9536182 0.83233247 0.92998934
0.98665454 0.98467454 0.99626968 0.94901788]
[0.97981291 1. 0.98847324 0.99355181 0.92235604 0.98233968
0.9988912 0.99726617 0.9887369 0.99158551]
[0.94124307 0.98847324 1. 0.99838656 0.96616102 0.99821823
0.98293761 0.98068081 0.95686081 0.99871721]
[0.9536182 0.99355181 0.99838656 1. 0.95647543 0.99579455
0.98912801 0.98759601 0.96840677 0.99921097]
[0.83233247 0.92235604 0.96616102 0.95647543 1. 0.97013945
0.90783962 0.91107934 0.86325861 0.96228133]
[0.92998934 0.98233968 0.99821823 0.99579455 0.97013945 1.
0.97583219 0.97198718 0.9455386 0.99559786]
[0.98665454 0.9988912 0.98293761 0.98912801 0.90783962 0.97583219
1. 0.99762043 0.99303179 0.98671206]
[0.98467454 0.99726617 0.98068081 0.98759601 0.91107934 0.97198718
0.99762043 1. 0.9937474 0.9862256 ]
[0.99626968 0.9887369 0.95686081 0.96840677 0.86325861 0.9455386
0.99303179 0.9937474 1. 0.96490636]
[0.94901788 0.99158551 0.99871721 0.99921097 0.96228133 0.99559786
0.98671206 0.9862256 0.96490636 1. ]]
As you can see, the values in the kernel are all very close to 1. This means the graphs are very similar to each other, which isn't the case. I want to know if this is normal? If not, how can I fix it? Can I use a MinMaxScaler instead? Also, is there any way to assign a higher importance to certain edges than the other edges? For example, I want to focus on the edges (bonds) between adsorbates and catalyst. If I have two graphs with same catalyst but different adsorbates, even though there is only a single node and a single edge difference, the similarity should not be too close. Is it possible?
Thanks in advance!