|
|
Named Entity Recognition Driven by High-Quality Text Data Accelerates the Knowledge Discovery of Nickel-Based Single Crystal Superalloys |
LIU Yue1, YAO Wenxuan1, LIU Dahui1, DING Lin1, YANG Zhengwei1, LIU Wei2, YU Tao3, SHI Siqi2,4( ) |
1 School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China 2 Materials Genome Institute, Shanghai University, Shanghai 200444, China 3 Division of Functional Materials, Central Iron and Steel Research Institute, Beijing 100081, China 4 School of Materials Science and Engineering, Shanghai University, Shanghai 200444, China |
|
Cite this article:
LIU Yue, YAO Wenxuan, LIU Dahui, DING Lin, YANG Zhengwei, LIU Wei, YU Tao, SHI Siqi. Named Entity Recognition Driven by High-Quality Text Data Accelerates the Knowledge Discovery of Nickel-Based Single Crystal Superalloys. Acta Metall Sin, 2024, 60(10): 1429-1438.
|
Abstract The knowledge regarding the structure-activity relationships of nickel-based single crystal superalloys is mainly stored in the form of unstructured text in the vast published scientific literature, and its effective utilization can accelerate the design of high-performance materials. Named entity recognition (NER) methods can be used to extract vital information from unstructured text, thus contributing to automatically achieving tedious text mining tasks. However, existing NER methods typically rely on a large amount of corpus data, especially of the deep-learning-based type, and can hardly tackle cross-domain tasks. To the best of our knowledge, no prior research has been conducted for the knowledge discovery of nickel-based single crystal superalloys based on deep-learning-based NER; thus, it is difficult to adapt existing methods to this field. Here, a semantic-features-fused NER (SF-NER) method based on deep learning was proposed, aiming to accurately extract knowledge from abstract text concerning nickel-based single crystal superalloys. Specifically, as data quality determines the performance of NER models, a high-quality annotated corpus dataset for the above-mentioned superalloys (containing 19405 entity data of eight entity types) was constructed. This was created via remote supervision using domain-specific materials dictionary under the domain knowledge's guidance. To accurately capture the terms related to specific materials from the high-quality corpus dataset, a encoding fusion strategy for word representation was proposed for encoding the essential semantic features of materials from various perspectives. Then, based on these semantic features, a deep learning model, i.e., bidirectional long short-term memory-cenditional random field (Bi-LSTM-CRF), was built to capture key semantic information in sentence sequences, thus accurately predicting entity types. The results of the experiment demonstrated that the proposed SF-NER method could accurately distinguish the entity categories of nickel-based single crystal superalloys (i.e., F1 = 0.84) and effectively identify the key factors influencing their service performance. Lastly, descriptors with high importance were recommended, as they can be employed for machine learning modeling to explore the structure-activity relationships of high-performance materials.
|
Received: 11 June 2024
|
|
Fund: National Natural Science Foundation of China(52073169,92270124);National Key Research and Development Program of China(2021YFB3802101) |
Corresponding Authors:
SHI Siqi, professor, Tel: 15800543880, E-mail: sqshi@shu.edu.cn
|
1 |
Shi S Q, Tu Z W, Zou X X, et al. Applying data-driven machine learning to studying electrochemical energy storage materials [J]. Energy Storage Sci. Technol., 2022, 11: 739
|
|
施思齐, 涂章伟, 邹欣欣 等. 数据驱动的机器学习在电化学储能材料研究中的应用 [J]. 储能科学与技术, 2022, 11: 739
doi: 10.19799/j.cnki.2095-4239.2022.0051
|
2 |
El-Bousiydy H, Lombardo T, Primo E N, et al. What can text mining tell us about lithium-ion battery researchers' habits? [J]. Batter. Supercaps, 2021, 4: 758
|
3 |
Mahbub R, Huang K, Jensen Z, et al. Text mining for processing conditions of solid-state battery electrolytes [J]. Electrochem. Commun., 2020, 121: 106860
|
4 |
Kim E, Huang K, Saunders A, et al. Materials synthesis insights from scientific literature via text extraction and machine learning [J]. Chem. Mater., 2017, 29: 9436
|
5 |
Huo H Y, Rong Z Q, Kononova O, et al. Semi-supervised machine-learning classification of materials synthesis procedures [J]. npj Comput. Mater., 2019, 5: 62
|
6 |
Wang W R, Jiang X, Tian S H, et al. Automated pipeline for superalloy data by text mining [J]. npj Comput. Mater., 2022, 8: 9
|
7 |
Hawizy L, Jessop D M, Adams N, et al. ChemicalTagger: A tool for semantic text-mining in chemistry [J]. J. Cheminf., 2011, 3: 17
|
8 |
Leaman R, Wei C H, Lu Z Y. tmChem: A high performance approach for chemical named entity recognition and normalization [J]. J. Cheminf., 2015, 7: S3
|
9 |
Kim E, Huang K, Jegelka S, et al. Virtual screening of inorganic materials synthesis parameters with deep learning [J]. npj Comput. Mater., 2017, 3: 53
|
10 |
LeCun Y, Boser B, Denker J S, et al. Backpropagation applied to handwritten zip code recognition [J]. Neural Comput., 1989, 1: 541
|
11 |
Williams R J, Zipser D. A learning algorithm for continually running fully recurrent neural networks [J]. Neural Comput., 1989, 1: 270
|
12 |
Hochreiter S, Schmidhuber J. Long short-term memory [J]. Neural Comput., 1997, 9: 1735
doi: 10.1162/neco.1997.9.8.1735
pmid: 9377276
|
13 |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [A]. Proceedings of the 31st International Conference on Neural Information Processing Systems [C]. Long Beach: Curran Associates Inc., 2017: 6000
|
14 |
Kuniyoshi F, Makino K, Ozawa J, et al. Annotating and extracting synthesis process of all-solid-state batteries from scientific literature [A]. Proceedings of the 12th Language Resources and Evaluation [C]. Marseille: European Language Resources Association, 2020: 1941
|
15 |
Liu Y, Ge X Y, Yang Z W, et al. An automatic descriptors recognizer customized for materials science literature [J]. J. Power Sources, 2022, 545: 231946
|
16 |
Sasidhar K N, Siboni N H, Mianroodi J R, et al. Enhancing corrosion-resistant alloy design through natural language processing and deep learning [J]. Sci. Adv., 2023, 9: eadg7992
|
17 |
Liu Y, Ding L, Yang Z W, et al. Domain knowledge discovery from abstracts of scientific literature on nickel-based single crystal superalloys [J]. Sci. China Technol. Sci., 2023, 66: 1815
|
18 |
Liu Y, Liu D H, Ge X Y, et al. A high-quality dataset construction method for text mining in materials science [J]. Acta Phys. Sin., 2023, 72: 070701
|
|
刘 悦, 刘大晖, 葛献远 等. 高质量的材料科学文本挖掘数据集构建方法 [J]. 物理学报, 2023, 72: 070701
|
19 |
Liu Y, Ma S C, Yang Z W, et al. A data quality and quantity governance for machine learning in materials science [J]. J. Chin. Ceram. Soc., 2023, 51: 427
|
|
刘 悦, 马舒畅, 杨正伟 等. 面向材料领域机器学习的数据质量治理 [J]. 硅酸盐学报, 2023, 51: 427
|
20 |
Liu Y, Yang Z W, Zou X X, et al. Data quantity governance for machine learning in materials science [J]. Natl. Sci. Rev., 2023, 10: nwad125
|
21 |
Liu Y, Zou X X, Yang Z W, et al. Machine learning embedded with materials domain knowledge [J]. J. Chin. Ceram. Soc., 2022, 50: 863
|
|
刘 悦, 邹欣欣, 杨正伟 等. 材料领域知识嵌入的机器学习 [J]. 硅酸盐学报, 2022, 50: 863
|
22 |
Shi S Q, Sun S Y, Ma S C, et al. Detection method on data accuracy incorporating materials domain knowledge [J]. J. Inorg. Mater., 2022, 37: 1311
doi: 10.15541/jim20220149
|
|
施思齐, 孙拾雨, 马舒畅 等. 融合材料领域知识的数据准确性检测方法 [J]. 无机材料学报, 2022, 37: 1311
doi: 10.15541/jim20220149
|
23 |
Goldberg Y. A primer on neural network models for natural language processing [J]. J. Artif. Intell. Res., 2016, 57: 345
|
24 |
Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch [J]. J. Artif. Intell. Res., 2011, 12: 2493
|
25 |
Jones K S. A statistical interpretation of term specificity and its application in retrieval [J]. J. Doc., 1972, 28: 11
|
26 |
Bird S. NLTK: The natural language toolkit [A]. Proceedings of COLING/ACL 2006 Interactive Presentation Sessions [C]. Sydney: Association for Computational Linguistics, 2006: 69
|
27 |
Nadkarni P M, Ohno-Machado L, Chapman W W. Natural language processing: an introduction [J]. J. Am. Med. Inform. Assoc., 2011, 18: 544
doi: 10.1136/amiajnl-2011-000464
pmid: 21846786
|
28 |
Viterbi A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm [J]. IEEE Trans. Inform. Theory, 1967, 13: 260
|
29 |
Lv J H, Du J P, Zhou N, et al. BERT-BIGRU-CRF: A novel entity relationship extraction model [A]. 2020 IEEE International Conference on Knowledge Graph [C]. Nanjing: IEEE, 2020: 157
|
30 |
Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation [A]. Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing [C]. Doha: Association for Computational Linguistics, 2014: 1532
|
No Suggested Reading articles found! |
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|