Please wait a minute...
Acta Metall Sin  2024, Vol. 60 Issue (10): 1429-1438    DOI: 10.11900/0412.1961.2024.00197
Research paper Current Issue | Archive | Adv Search |
Named Entity Recognition Driven by High-Quality Text Data Accelerates the Knowledge Discovery of Nickel-Based Single Crystal Superalloys
LIU Yue1, YAO Wenxuan1, LIU Dahui1, DING Lin1, YANG Zhengwei1, LIU Wei2, YU Tao3, SHI Siqi2,4()
1 School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
2 Materials Genome Institute, Shanghai University, Shanghai 200444, China
3 Division of Functional Materials, Central Iron and Steel Research Institute, Beijing 100081, China
4 School of Materials Science and Engineering, Shanghai University, Shanghai 200444, China
Cite this article: 

LIU Yue, YAO Wenxuan, LIU Dahui, DING Lin, YANG Zhengwei, LIU Wei, YU Tao, SHI Siqi. Named Entity Recognition Driven by High-Quality Text Data Accelerates the Knowledge Discovery of Nickel-Based Single Crystal Superalloys. Acta Metall Sin, 2024, 60(10): 1429-1438.

Download:  HTML  PDF(2168KB) 
Export:  BibTeX | EndNote (RIS)      
Abstract  

The knowledge regarding the structure-activity relationships of nickel-based single crystal superalloys is mainly stored in the form of unstructured text in the vast published scientific literature, and its effective utilization can accelerate the design of high-performance materials. Named entity recognition (NER) methods can be used to extract vital information from unstructured text, thus contributing to automatically achieving tedious text mining tasks. However, existing NER methods typically rely on a large amount of corpus data, especially of the deep-learning-based type, and can hardly tackle cross-domain tasks. To the best of our knowledge, no prior research has been conducted for the knowledge discovery of nickel-based single crystal superalloys based on deep-learning-based NER; thus, it is difficult to adapt existing methods to this field. Here, a semantic-features-fused NER (SF-NER) method based on deep learning was proposed, aiming to accurately extract knowledge from abstract text concerning nickel-based single crystal superalloys. Specifically, as data quality determines the performance of NER models, a high-quality annotated corpus dataset for the above-mentioned superalloys (containing 19405 entity data of eight entity types) was constructed. This was created via remote supervision using domain-specific materials dictionary under the domain knowledge's guidance. To accurately capture the terms related to specific materials from the high-quality corpus dataset, a encoding fusion strategy for word representation was proposed for encoding the essential semantic features of materials from various perspectives. Then, based on these semantic features, a deep learning model, i.e., bidirectional long short-term memory-cenditional random field (Bi-LSTM-CRF), was built to capture key semantic information in sentence sequences, thus accurately predicting entity types. The results of the experiment demonstrated that the proposed SF-NER method could accurately distinguish the entity categories of nickel-based single crystal superalloys (i.e., F1 = 0.84) and effectively identify the key factors influencing their service performance. Lastly, descriptors with high importance were recommended, as they can be employed for machine learning modeling to explore the structure-activity relationships of high-performance materials.

Key words:  data quality      deep learning      named entity recognition      nickel-based single crystal superalloy      domain knowledge     
Received:  11 June 2024     
ZTFLH:  TG131  
Fund: National Natural Science Foundation of China(52073169,92270124);National Key Research and Development Program of China(2021YFB3802101)
Corresponding Authors:  SHI Siqi, professor, Tel: 15800543880, E-mail: sqshi@shu.edu.cn

URL: 

https://www.ams.org.cn/EN/10.11900/0412.1961.2024.00197     OR     https://www.ams.org.cn/EN/Y2024/V60/I10/1429

Fig.1  Diagram for literature mining and field application of nickel-based single crystal superalloys using the SF-NER (BPE—byte-pair encoding, NER—named entity recognition, ML—machine learning, CRF—conditional random field, B—the beginning of an entity (Begin), I—the inside of an entity (Inside), O—outside of an entity (Outside))
Fig.2  Material named entity recognition framework for Bi-LSTM-CRF (Bi-LSTM—bi-directional long short term memory, PRP—property)
ModelPrecisionRecallF1-score
BERT-Bi-GRU-CRF0.570.620.60
Bi-LSTM(Glove)-CRF0.800.810.80
Bi-LSTM(OneHot)-CRF0.810.810.81
Bi-LSTM(OneHot-Glove)-CRF0.820.830.82
SF-NER0.840.840.84
Table 1  Precision, recall, and F1-score of different models on dataset A_DomainDictionary
Fig.3  Ten-fold cross-validation results of the SF-NER on dataset A_DomainDictionary
(a) performance of SF-NER during ten-fold cross-validation
(b) number of word entities recognized by SF-NER during ten-fold cross-validation
ModelA_ManualLabelingA_DomainDictionary
BERT-Bi-GRU-CRF0.440.60
Bi-LSTM(Glove)-CRF0.750.80
Bi-LSTM(OneHot-BPE)-CRF0.780.84
Table 2  Performance (F1-score) of different models on two datasets
Fig.4  Precision, recall, and F1-score of the SF-NER model for different-type entities
Fig.5  Recommended descriptor importance ranking and entities already focused on in materials machine learning (partial display) (HIP—hot isostatic pressing, TCP—topologically close-packed phases, EB-PVD—electron beam-physical vapor deposition, SRZ—secondary reaction zone, IDZ—interdiffusion zone)
1 Shi S Q, Tu Z W, Zou X X, et al. Applying data-driven machine learning to studying electrochemical energy storage materials [J]. Energy Storage Sci. Technol., 2022, 11: 739
施思齐, 涂章伟, 邹欣欣 等. 数据驱动的机器学习在电化学储能材料研究中的应用 [J]. 储能科学与技术, 2022, 11: 739
doi: 10.19799/j.cnki.2095-4239.2022.0051
2 El-Bousiydy H, Lombardo T, Primo E N, et al. What can text mining tell us about lithium-ion battery researchers' habits? [J]. Batter. Supercaps, 2021, 4: 758
3 Mahbub R, Huang K, Jensen Z, et al. Text mining for processing conditions of solid-state battery electrolytes [J]. Electrochem. Commun., 2020, 121: 106860
4 Kim E, Huang K, Saunders A, et al. Materials synthesis insights from scientific literature via text extraction and machine learning [J]. Chem. Mater., 2017, 29: 9436
5 Huo H Y, Rong Z Q, Kononova O, et al. Semi-supervised machine-learning classification of materials synthesis procedures [J]. npj Comput. Mater., 2019, 5: 62
6 Wang W R, Jiang X, Tian S H, et al. Automated pipeline for superalloy data by text mining [J]. npj Comput. Mater., 2022, 8: 9
7 Hawizy L, Jessop D M, Adams N, et al. ChemicalTagger: A tool for semantic text-mining in chemistry [J]. J. Cheminf., 2011, 3: 17
8 Leaman R, Wei C H, Lu Z Y. tmChem: A high performance approach for chemical named entity recognition and normalization [J]. J. Cheminf., 2015, 7: S3
9 Kim E, Huang K, Jegelka S, et al. Virtual screening of inorganic materials synthesis parameters with deep learning [J]. npj Comput. Mater., 2017, 3: 53
10 LeCun Y, Boser B, Denker J S, et al. Backpropagation applied to handwritten zip code recognition [J]. Neural Comput., 1989, 1: 541
11 Williams R J, Zipser D. A learning algorithm for continually running fully recurrent neural networks [J]. Neural Comput., 1989, 1: 270
12 Hochreiter S, Schmidhuber J. Long short-term memory [J]. Neural Comput., 1997, 9: 1735
doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276
13 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [A]. Proceedings of the 31st International Conference on Neural Information Processing Systems [C]. Long Beach: Curran Associates Inc., 2017: 6000
14 Kuniyoshi F, Makino K, Ozawa J, et al. Annotating and extracting synthesis process of all-solid-state batteries from scientific literature [A]. Proceedings of the 12th Language Resources and Evaluation [C]. Marseille: European Language Resources Association, 2020: 1941
15 Liu Y, Ge X Y, Yang Z W, et al. An automatic descriptors recognizer customized for materials science literature [J]. J. Power Sources, 2022, 545: 231946
16 Sasidhar K N, Siboni N H, Mianroodi J R, et al. Enhancing corrosion-resistant alloy design through natural language processing and deep learning [J]. Sci. Adv., 2023, 9: eadg7992
17 Liu Y, Ding L, Yang Z W, et al. Domain knowledge discovery from abstracts of scientific literature on nickel-based single crystal superalloys [J]. Sci. China Technol. Sci., 2023, 66: 1815
18 Liu Y, Liu D H, Ge X Y, et al. A high-quality dataset construction method for text mining in materials science [J]. Acta Phys. Sin., 2023, 72: 070701
刘 悦, 刘大晖, 葛献远 等. 高质量的材料科学文本挖掘数据集构建方法 [J]. 物理学报, 2023, 72: 070701
19 Liu Y, Ma S C, Yang Z W, et al. A data quality and quantity governance for machine learning in materials science [J]. J. Chin. Ceram. Soc., 2023, 51: 427
刘 悦, 马舒畅, 杨正伟 等. 面向材料领域机器学习的数据质量治理 [J]. 硅酸盐学报, 2023, 51: 427
20 Liu Y, Yang Z W, Zou X X, et al. Data quantity governance for machine learning in materials science [J]. Natl. Sci. Rev., 2023, 10: nwad125
21 Liu Y, Zou X X, Yang Z W, et al. Machine learning embedded with materials domain knowledge [J]. J. Chin. Ceram. Soc., 2022, 50: 863
刘 悦, 邹欣欣, 杨正伟 等. 材料领域知识嵌入的机器学习 [J]. 硅酸盐学报, 2022, 50: 863
22 Shi S Q, Sun S Y, Ma S C, et al. Detection method on data accuracy incorporating materials domain knowledge [J]. J. Inorg. Mater., 2022, 37: 1311
doi: 10.15541/jim20220149
施思齐, 孙拾雨, 马舒畅 等. 融合材料领域知识的数据准确性检测方法 [J]. 无机材料学报, 2022, 37: 1311
doi: 10.15541/jim20220149
23 Goldberg Y. A primer on neural network models for natural language processing [J]. J. Artif. Intell. Res., 2016, 57: 345
24 Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch [J]. J. Artif. Intell. Res., 2011, 12: 2493
25 Jones K S. A statistical interpretation of term specificity and its application in retrieval [J]. J. Doc., 1972, 28: 11
26 Bird S. NLTK: The natural language toolkit [A]. Proceedings of COLING/ACL 2006 Interactive Presentation Sessions [C]. Sydney: Association for Computational Linguistics, 2006: 69
27 Nadkarni P M, Ohno-Machado L, Chapman W W. Natural language processing: an introduction [J]. J. Am. Med. Inform. Assoc., 2011, 18: 544
doi: 10.1136/amiajnl-2011-000464 pmid: 21846786
28 Viterbi A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm [J]. IEEE Trans. Inform. Theory, 1967, 13: 260
29 Lv J H, Du J P, Zhou N, et al. BERT-BIGRU-CRF: A novel entity relationship extraction model [A]. 2020 IEEE International Conference on Knowledge Graph [C]. Nanjing: IEEE, 2020: 157
30 Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation [A]. Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing [C]. Doha: Association for Computational Linguistics, 2014: 1532
[1] CHEN Mohan. Progress of the ABACUS Software for Density Functional Theory and Its Integration and Applications with Deep Learning Algorithms[J]. 金属学报, 2024, 60(10): 1405-1417.
[2] JI Xiumei, HOU Meiling, WANG Long, LIU Jie, GAO Kewei. Modeling and Application of Deformation Resistance Model for Medium and Heavy Plate Based on Machine Learning[J]. 金属学报, 2023, 59(3): 435-446.
[3] HE Siliang, ZHAO Yunsong, LU Fan, ZHANG Jian, LI Longfei, FENG Qiang. Effects of Hot Isostatic Pressure on Microdefects and Stress Rupture Life of Second-Generation Nickel-Based Single Crystal Superalloy in As-Cast and As-Solid-Solution States[J]. 金属学报, 2020, 56(9): 1195-1205.
[4] Jinyao MA,Jin WANG,Yunsong ZHAO,Jian ZHANG,Yuefei ZHANG,Jixue LI,Ze ZHANG. Investigation of In Situ 1150 High Temperature Deformation Behavior and Fracture Mechanism of a Second Generation Single Crystal Superalloy[J]. 金属学报, 2019, 55(8): 987-996.
No Suggested Reading articles found!