qmof介紹，matdeeplearn MOFS 轉圖數據附帶pymatgen, vasp

本文轉載自查看原文 2022-03-09 13:18 1327 gnn/ mofs

QMOF

QMOF是目前最全面的mof數據庫，對應有平台https://next-gen.materialsproject.org/

MatDeepLearn 是目前包含QMOF的第一個材料化學GNN baseline框架。vxfung/MatDeepLearn: MatDeepLearn, package for graph neural networks in materials chemistry (github.com)

本文先對qmof的文檔中的數據標准（Version History - Materials Project Documentation）進行整理；然后對代碼中（ qmof:https://github.com/arosen93/QMOF ）的抽取mofs結構圖數據的方法進行分析。最后對MatDL框架的圖數據標准進行分析。。

后續工作包括從CSD即其它數據集上下載更多的數據，理解CIF文件結構與內容，定義我們的MOFS結構圖數據標准。

QMOF的數據來源

數據篩選准則：QMOF為了保證數據 DFT-ready 進行去重、去非法結構，為了保證DFT的效率，約束300個原子以下。

數據源：CDS&CORE等9個數據庫（Structure Sources - Materials Project Documentation），更新致20376個數據 2021/12/8

其它數據庫簡介：

Pyrene MOFs：實驗表征的含芘 MOF。 Materials Cloud.

title

(以下都是hMOFs)

TOBACCO ：根據構造單元生成hMOF。 Topology-Based Crystal Constructor (ToBaCCo) code

Anderson and Gómez-Gualdrón dataset ： TOBACCO 生成的hMOF

Woo& Boyd et al. TOBACCO 生成的hMOF Materials Cloud.

Genomic MOF Database: hMOFs, available on Figshare

Hypothetical MOF-74s,Hypothetical MOF-74s: hMOFS, here

QMOF的標准字段

QMOF-ID：qmof-七位16進制 qmof-1abcd2。不同的id有不同的原始單元晶格。不同的依據是 Pymatgen's StructureMatcher.

MOF-ID: SMILE表達式SMILES（一串字符來描述一個三維化學結構）_百度百科 (baidu.com)，用於mof檢索，（也可用DOI，CSD refcode）。通過SMILE算法生成。

拓撲結構：點數、邊數、構造單元連通性等。通過MOF-ID在 Reticular Chemistry Structure Resource. 上檢索得到，該網站包含數千個拓撲結構。

一下是量子性質：（標簽）

帶隙：通過 Pymatgen's EIGENVAL parser得到。

partial atomic charges，multiple magnetic properties，Bond orders，Density of states，孔隙幾何性質。

VASP setting：DFT的參數。

cif文件轉圖數據

該方法利用pymatgen,ASE 庫解析cif文件得到mofs的原子結構（包括鄰居和空間信息）。基於該結構，用距離閾值定義化學鍵，最后使用n最近鄰建圖。

輸出

2個鄰接表，表示單個mofs原子間的連接關系。（鄰居id存一個表，屬性另一個表；取最近鄰的操作保證表格列數相同）
每個節點的特征向量。
整張圖的標簽和id。

輸入

數據文件夾root_dir有如下結構：

root_dir
├── id_prop.csv
├── atom_init.json
├── id0.cif
├── id1.cif
├── ...

其中 id0是cif文件。

CIF的語法較復雜，完整cif文件包含的內容很多。其中各項含義可見CIF文件詳解 - 百度文庫 (baidu.com)，官方文檔(IUCr) A guide to CIF for authors。

data_image0
_cell_length_a 16.991
_cell_length_b 16.991
_cell_length_c 16.991
_cell_angle_alpha 90
_cell_angle_beta 90
_cell_angle_gamma 90

_symmetry_space_group_name_H-M "P 1"
_symmetry_int_tables_number 1

loop_
_symmetry_equiv_pos_as_xyz
'x, y, z'

loop_
_atom_site_label
_atom_site_occupancy
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
_atom_site_thermal_displace_type
_atom_site_B_iso_or_equiv
_atom_site_type_symbol
C1 1.0000 0.37705 0.00790 0.62295 Biso 1.000 C
C2 1.0000 0.36851 0.89914 0.68750 Biso 1.000 C
H1 1.0000 0.37770 0.85890 0.72340 Biso 1.000 H
C3 1.0000 0.40610 0.08550 0.59390 Biso 1.000 C
N1 1.0000 0.40973 0.96832 0.68278 Biso 1.000 N

csv文件包含cif的id和屬性的對應。

1,1.0
2,2.0
3,3.0

json文件包含原素的隨機初始嵌入向量（可以是onehot)。

"1": [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0],
"2": ...

建圖算法流程

得到所有原子：Structure.from_file(id0.cif）:將cif讀入，轉化為pymatgen.structure
得到原子的特征向量（初始嵌入向量）atom_fea： AtomCustomJSONInitializer.get_atom_fea
得到原子的鄰居all_nbrs: Structure.get_all_neighbors
建圖，得到nbr_id，nbr_fea：以radius為閾值建邊，取最近的max_num_nbr個。不足的對屬性向量進行進行正規化（？）.最后調用GaussianDistance.expand

總結：根據距離閾值連化學鍵，再取最鄰近的n個，（nbr_fea也可以看出是鍵的特征）屬性中有一些人工提取的距離特征（正規化，高斯距離）。

改進：額外考慮鍵的3維特征（ALIGNN），額外考慮子圖特征（官能團、次級結構）。

CDS cif轉 mofs cif

從 Cambridge Structural Database 的原始cif文件經過處理轉換，得到qmof數據庫文件的流程。目的是去除失真的結構（maximize structural fidelity），涉及化學知識。

流程

1通過ase庫規范化CIF文件

from ase.io import read, write
for cif in cifs:
mof = read(os.path.join(cif_path, cif))
write(os.path.join(cif_path, cif), mof)

2通過pymatgen庫得到晶體最小的重復單元

structure = Structure.from_file(os.path.join(folder,entry),primitive=True)

3去除非法的cif文件，依據有dist，lone，duplicate，oxo

4 支持xyz文件相互轉換

ASE-formatted appended XYZ file to a folder of CIFs.

matDL的建圖標准

使用了pytorch geometric 包來實現GNN算法。

讀取json格式的qmof數據，轉換成如下格式：

├──dictionary_source.json
├──root_dir
├── targets.csv
├── qmof-1234567.json
├── qmof-89abcde.json
├── ...

json的格式為:

{"1": {
 "cell": {"array": {"__ndarray__": [[3, 3], "float64", [9.806075328, 0.108025938, 8.4506e-05, -2.412034763, 9.564903313, -3.8425e-05, 0.000113935, -2.188e-05, 13.246085401]]}, "__ase_objtype__": "cell"},
 "ctime": 22.202870969564355,
 "mtime": 22.202870969564355,
 "numbers": {"__ndarray__": [[108], "int32", [27, 27, 27, 27, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]]},
 "pbc": {"__ndarray__": [[3], "bool", [true, true, true]]},
 "positions": {"__ndarray__": [[108, 3], "float64", [9.000000002061375e-05, -1.0000000015167157e-05, 6.623059999999954, 3.6971139349999804, 4.836458119999971, 13.246105400999976, ...,]]},
 "unique_id": "fc8de825d5b9a6edc082f11ef7e5db52",
 "user": null},
"ids": [1],
"nextid": 2}

dataloader

（pyg的dataloader分為處理小規模數據的in memory loader和對應的dataloader。核心是process函數；dictionary_source: 元素的onehot）

Data(edge_index=[2, 507], edge_weight=[507], y=0.6325269937515259,
z=[39], u=[1, 3], structure_id=[1], x=[39, 114], edge_attr=[507, 50])

主要屬性

edge_index, edge_weight ：稀疏鄰接矩陣，直接通過ase_crystal.get_all_distances后矩陣處理得到：

##Obtain distance matrix with ase
distance_matrix = ase_crystal.get_all_distances(mic=True)

##Create sparse graph from distance matrix
distance_matrix_trimmed = threshold_sort(
distance_matrix,
processing_args["graph_max_radius"],
processing_args["graph_max_neighbors"],
adj=False,
)

distance_matrix_trimmed = torch.Tensor(distance_matrix_trimmed)
out = dense_to_sparse(distance_matrix_trimmed)
edge_index = out[0]
edge_weight = out[1]

X:節點嵌入向量，初始值式mofs每個原子的onehot向量堆疊

y label 一維tensor

輔助信息

structure_id : QMOF ID

u placeholder for state feature？[3*n]

z 原子總數

額外的descriptor

edge_vonoroi,SOAP,SM,edge_descriptor

圖數據集的屬性:length,species(desctiptor 可能用到)

化學計算庫介紹

pymatgen

https://pymatgen.org/ for materials analysis。

conda install --channel conda-forge pymatgen

qmof中主要用來讀取cif文件並利用其中的Structure類來表示cif，並建圖。

核心類

pymatgen.core包提供了分子和晶體結構的數據結構，核心的類包括：

species是元素（形態）,

composition是元素及其量的鍵值對{element:amount}

site是composition及其在空間上的位置, 以及可能的屬性（如磁性）。

periodic site 是有lattice system的cite

lattice system 晶格？

molecule和 stucture （有周期）, 是site/periodic site 的數組。

屬性包括：occupancies， ang, occupancy，length units are in Angstroms and angles are in degrees.

ASE

conda install --channel conda-forge pymatgen 時會一起安裝

ASE是一個用Python編程語言編寫的原子模擬環境，旨在設置、指導和分析原子模擬。qmof中主要用來讀取cif文件。

VASP

VASP是維也納大學Hafner小組開發的進行電子結構計算和量子力學-分子動力學模擬軟件包。它是目前材料模擬和計算物質科學研究中最流行的商用軟件之一。用戶需要確保版權。之后可以可以用來計算mof結構的性質。

# 導入編譯器
module load intel/2017.1
tar xf vasp.5.4.4.tar.gz
cd vasp.5.4.4
cp arch/makefile.include.linux_intel ./makefile.include
make all
# 編譯完成后會在vasp.5.4.4的文件夾下的bin文件夾里生成vasp_gam、vasp_ncl、vasp_std三個可自行文件。

# gpu版本安裝
module load intel/2017.1
tar xf vasp.5.4.4.tar.gz
cd vasp.5.4.4
cp arch/makefile.include.linux_intel makefile.include
# 修改 -openmp 為 -qopenmp
make gpu
# GPU版本使用的時候需要載入cuda，高版本如2018的intel編譯器編譯時會報錯

VASP:Vasp · Doc (pku.edu.cn) 科學網—VASP 固定POSCAR中部分原子的四類方法 - 郭令舉的博文 (sciencenet.cn)

總結

QMOF的gcn和matDL的輸入文件相似，只是前者用cif，后者是json，兩者都可以用ase庫處理。

matDL pyg框架下，用矩陣技巧處理，並增加了很多特性用於擴展到多種GNN算法。 QMOF是pytorch簡單實現的gcn。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 轉-數據流圖的畫法 neo4j 圖數據庫安裝及介紹 GPU加速VASP Layui 文件上傳附帶data數據 [轉]數據結構：圖的存儲結構之鄰接矩陣對象轉json字符串，附帶SerializerFeature屬性說明 Python基本數據類型詳細介紹(轉) 【UML】活動圖介紹 VASP 5.4.4極簡安裝方法 Axure 原型圖 (轉)

qmof介紹，matdeeplearn MOFS 轉圖數據 附帶pymatgen, vasp