Please wait a minute...
金属学报  2024, Vol. 60 Issue (10): 1429-1438    DOI: 10.11900/0412.1961.2024.00197
  研究论文 本期目录 | 过刊浏览 |
高质量文本数据驱动的命名实体识别加速镍基单晶高温合金材料知识发现
刘悦1, 姚文轩1, 刘大晖1, 丁琳1, 杨正伟1, 刘微2, 于涛3, 施思齐2,4()
1 上海大学 计算机工程与科学学院 上海 200444
2 上海大学 材料基因组工程研究院 上海 200444
3 钢铁研究总院 功能材料研究所 北京 100081
4 上海大学 材料科学与工程学院 上海 200444
Named Entity Recognition Driven by High-Quality Text Data Accelerates the Knowledge Discovery of Nickel-Based Single Crystal Superalloys
LIU Yue1, YAO Wenxuan1, LIU Dahui1, DING Lin1, YANG Zhengwei1, LIU Wei2, YU Tao3, SHI Siqi2,4()
1 School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
2 Materials Genome Institute, Shanghai University, Shanghai 200444, China
3 Division of Functional Materials, Central Iron and Steel Research Institute, Beijing 100081, China
4 School of Materials Science and Engineering, Shanghai University, Shanghai 200444, China
引用本文:

刘悦, 姚文轩, 刘大晖, 丁琳, 杨正伟, 刘微, 于涛, 施思齐. 高质量文本数据驱动的命名实体识别加速镍基单晶高温合金材料知识发现[J]. 金属学报, 2024, 60(10): 1429-1438.
Yue LIU, Wenxuan YAO, Dahui LIU, Lin DING, Zhengwei YANG, Wei LIU, Tao YU, Siqi SHI. Named Entity Recognition Driven by High-Quality Text Data Accelerates the Knowledge Discovery of Nickel-Based Single Crystal Superalloys[J]. Acta Metall Sin, 2024, 60(10): 1429-1438.

全文: PDF(2168 KB)   HTML
摘要: 

镍基单晶高温合金构效关系知识常常以非结构化文本的形式存储在海量公开发表的科学文献中。利用命名实体识别(NER)方法从非结构化文本中挖掘关键信息已成为助力新材料研发的重要方式。然而,已有NER方法依赖于大量语料数据支撑且不适用于处理跨领域任务,导致其难以适配镍基单晶高温合金领域。本工作提出基于语义特征融合的深度学习命名实体识别方法(SF-NER),以准确挖掘摘要文本中蕴含的镍基单晶高温合金知识。在领域知识指导下创建材料领域词典以实现远程监督,并建立了高质量镍基单晶高温合金标注语料库(含8类实体类型的19405个实体数据);为准确捕捉特定材料术语,提出了融合编码的词表征策略以捕获关键材料语义特征;构建双向长短期记忆网络-条件随机场(Bi-LSTM-CRF)模型捕捉句子序列中的关键语义信息以实现实体标签的精准预测。实验结果表明,SF-NER能够精准识别镍基单晶高温合金实体类别(评价指标F1值为0.84),有效筛选影响高温合金服役性能的关键因素,并推荐出可用于服役性能构效关系挖掘的高重要度描述符。

关键词 数据质量深度学习命名实体识别镍基单晶高温合金领域知识    
Abstract

The knowledge regarding the structure-activity relationships of nickel-based single crystal superalloys is mainly stored in the form of unstructured text in the vast published scientific literature, and its effective utilization can accelerate the design of high-performance materials. Named entity recognition (NER) methods can be used to extract vital information from unstructured text, thus contributing to automatically achieving tedious text mining tasks. However, existing NER methods typically rely on a large amount of corpus data, especially of the deep-learning-based type, and can hardly tackle cross-domain tasks. To the best of our knowledge, no prior research has been conducted for the knowledge discovery of nickel-based single crystal superalloys based on deep-learning-based NER; thus, it is difficult to adapt existing methods to this field. Here, a semantic-features-fused NER (SF-NER) method based on deep learning was proposed, aiming to accurately extract knowledge from abstract text concerning nickel-based single crystal superalloys. Specifically, as data quality determines the performance of NER models, a high-quality annotated corpus dataset for the above-mentioned superalloys (containing 19405 entity data of eight entity types) was constructed. This was created via remote supervision using domain-specific materials dictionary under the domain knowledge's guidance. To accurately capture the terms related to specific materials from the high-quality corpus dataset, a encoding fusion strategy for word representation was proposed for encoding the essential semantic features of materials from various perspectives. Then, based on these semantic features, a deep learning model, i.e., bidirectional long short-term memory-cenditional random field (Bi-LSTM-CRF), was built to capture key semantic information in sentence sequences, thus accurately predicting entity types. The results of the experiment demonstrated that the proposed SF-NER method could accurately distinguish the entity categories of nickel-based single crystal superalloys (i.e., F1 = 0.84) and effectively identify the key factors influencing their service performance. Lastly, descriptors with high importance were recommended, as they can be employed for machine learning modeling to explore the structure-activity relationships of high-performance materials.

Key wordsdata quality    deep learning    named entity recognition    nickel-based single crystal superalloy    domain knowledge
收稿日期: 2024-06-11     
ZTFLH:  TG131  
基金资助:国家自然科学基金项目(52073169,92270124);国家重点研发计划项目(2021YFB3802101)
通讯作者: 施思齐,sqshi@shu.edu.cn,主要从事电化学储能材料计算与设计研究
Corresponding author: SHI Siqi, professor, Tel: 15800543880, E-mail: sqshi@shu.edu.cn
作者简介: 刘 悦,女,1975年生,博士
图1  基于语义特征融合的深度学习命名实体识别(SF-NER)方法的镍基单晶高温合金文献挖掘及应用流程图
图2  基于双向长短期记忆网络-条件随机场(Bi-LSTM-CRF)模型的材料命名实体识别框架
ModelPrecisionRecallF1-score
BERT-Bi-GRU-CRF0.570.620.60
Bi-LSTM(Glove)-CRF0.800.810.80
Bi-LSTM(OneHot)-CRF0.810.810.81
Bi-LSTM(OneHot-Glove)-CRF0.820.830.82
SF-NER0.840.840.84
表1  不同模型在数据集A_DomainDictionary上的准确率、召回率和F1值
图3  SF-NER模型在数据集A_DomainDictionary上的十折交叉验证结果
ModelA_ManualLabelingA_DomainDictionary
BERT-Bi-GRU-CRF0.440.60
Bi-LSTM(Glove)-CRF0.750.80
Bi-LSTM(OneHot-BPE)-CRF0.780.84
表2  不同模型在2个数据集上的表现(F1值)
图4  SF-NER模型对不同类型实体的识别准确率、召回率和F1值
图5  推荐的描述符及已被材料机器学习关注的实体(部分展示)
1 Shi S Q, Tu Z W, Zou X X, et al. Applying data-driven machine learning to studying electrochemical energy storage materials [J]. Energy Storage Sci. Technol., 2022, 11: 739
1 施思齐, 涂章伟, 邹欣欣 等. 数据驱动的机器学习在电化学储能材料研究中的应用 [J]. 储能科学与技术, 2022, 11: 739
doi: 10.19799/j.cnki.2095-4239.2022.0051
2 El-Bousiydy H, Lombardo T, Primo E N, et al. What can text mining tell us about lithium-ion battery researchers' habits? [J]. Batter. Supercaps, 2021, 4: 758
3 Mahbub R, Huang K, Jensen Z, et al. Text mining for processing conditions of solid-state battery electrolytes [J]. Electrochem. Commun., 2020, 121: 106860
4 Kim E, Huang K, Saunders A, et al. Materials synthesis insights from scientific literature via text extraction and machine learning [J]. Chem. Mater., 2017, 29: 9436
5 Huo H Y, Rong Z Q, Kononova O, et al. Semi-supervised machine-learning classification of materials synthesis procedures [J]. npj Comput. Mater., 2019, 5: 62
6 Wang W R, Jiang X, Tian S H, et al. Automated pipeline for superalloy data by text mining [J]. npj Comput. Mater., 2022, 8: 9
7 Hawizy L, Jessop D M, Adams N, et al. ChemicalTagger: A tool for semantic text-mining in chemistry [J]. J. Cheminf., 2011, 3: 17
8 Leaman R, Wei C H, Lu Z Y. tmChem: A high performance approach for chemical named entity recognition and normalization [J]. J. Cheminf., 2015, 7: S3
9 Kim E, Huang K, Jegelka S, et al. Virtual screening of inorganic materials synthesis parameters with deep learning [J]. npj Comput. Mater., 2017, 3: 53
10 LeCun Y, Boser B, Denker J S, et al. Backpropagation applied to handwritten zip code recognition [J]. Neural Comput., 1989, 1: 541
11 Williams R J, Zipser D. A learning algorithm for continually running fully recurrent neural networks [J]. Neural Comput., 1989, 1: 270
12 Hochreiter S, Schmidhuber J. Long short-term memory [J]. Neural Comput., 1997, 9: 1735
doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276
13 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [A]. Proceedings of the 31st International Conference on Neural Information Processing Systems [C]. Long Beach: Curran Associates Inc., 2017: 6000
14 Kuniyoshi F, Makino K, Ozawa J, et al. Annotating and extracting synthesis process of all-solid-state batteries from scientific literature [A]. Proceedings of the 12th Language Resources and Evaluation [C]. Marseille: European Language Resources Association, 2020: 1941
15 Liu Y, Ge X Y, Yang Z W, et al. An automatic descriptors recognizer customized for materials science literature [J]. J. Power Sources, 2022, 545: 231946
16 Sasidhar K N, Siboni N H, Mianroodi J R, et al. Enhancing corrosion-resistant alloy design through natural language processing and deep learning [J]. Sci. Adv., 2023, 9: eadg7992
17 Liu Y, Ding L, Yang Z W, et al. Domain knowledge discovery from abstracts of scientific literature on nickel-based single crystal superalloys [J]. Sci. China Technol. Sci., 2023, 66: 1815
18 Liu Y, Liu D H, Ge X Y, et al. A high-quality dataset construction method for text mining in materials science [J]. Acta Phys. Sin., 2023, 72: 070701
18 刘 悦, 刘大晖, 葛献远 等. 高质量的材料科学文本挖掘数据集构建方法 [J]. 物理学报, 2023, 72: 070701
19 Liu Y, Ma S C, Yang Z W, et al. A data quality and quantity governance for machine learning in materials science [J]. J. Chin. Ceram. Soc., 2023, 51: 427
19 刘 悦, 马舒畅, 杨正伟 等. 面向材料领域机器学习的数据质量治理 [J]. 硅酸盐学报, 2023, 51: 427
20 Liu Y, Yang Z W, Zou X X, et al. Data quantity governance for machine learning in materials science [J]. Natl. Sci. Rev., 2023, 10: nwad125
21 Liu Y, Zou X X, Yang Z W, et al. Machine learning embedded with materials domain knowledge [J]. J. Chin. Ceram. Soc., 2022, 50: 863
21 刘 悦, 邹欣欣, 杨正伟 等. 材料领域知识嵌入的机器学习 [J]. 硅酸盐学报, 2022, 50: 863
22 Shi S Q, Sun S Y, Ma S C, et al. Detection method on data accuracy incorporating materials domain knowledge [J]. J. Inorg. Mater., 2022, 37: 1311
doi: 10.15541/jim20220149
22 施思齐, 孙拾雨, 马舒畅 等. 融合材料领域知识的数据准确性检测方法 [J]. 无机材料学报, 2022, 37: 1311
doi: 10.15541/jim20220149
23 Goldberg Y. A primer on neural network models for natural language processing [J]. J. Artif. Intell. Res., 2016, 57: 345
24 Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch [J]. J. Artif. Intell. Res., 2011, 12: 2493
25 Jones K S. A statistical interpretation of term specificity and its application in retrieval [J]. J. Doc., 1972, 28: 11
26 Bird S. NLTK: The natural language toolkit [A]. Proceedings of COLING/ACL 2006 Interactive Presentation Sessions [C]. Sydney: Association for Computational Linguistics, 2006: 69
27 Nadkarni P M, Ohno-Machado L, Chapman W W. Natural language processing: an introduction [J]. J. Am. Med. Inform. Assoc., 2011, 18: 544
doi: 10.1136/amiajnl-2011-000464 pmid: 21846786
28 Viterbi A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm [J]. IEEE Trans. Inform. Theory, 1967, 13: 260
29 Lv J H, Du J P, Zhou N, et al. BERT-BIGRU-CRF: A novel entity relationship extraction model [A]. 2020 IEEE International Conference on Knowledge Graph [C]. Nanjing: IEEE, 2020: 157
30 Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation [A]. Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing [C]. Doha: Association for Computational Linguistics, 2014: 1532
[1] 刘静, 张思倩, 王栋, 王莉, 陈立佳. Ta对一种抗热腐蚀镍基单晶高温合金长时热暴露组织和蠕变性能的影响[J]. 金属学报, 2024, 60(2): 179-188.
[2] 张振武, 李继康, 许文贺, 沈沐宇, 戚磊一, 郑可盈, 李伟, 魏青松. 搭接工艺对选区激光熔化镍基单晶高温合金DD491晶体取向与微观组织的影响[J]. 金属学报, 2024, 60(11): 1471-1486.
[3] 陈默涵. 密度泛函理论软件ABACUS进展及其与深度学习算法的融合及应用[J]. 金属学报, 2024, 60(10): 1405-1417.
[4] 赵鹏, 谢光, 段慧超, 张健, 杜奎. 两种高代次镍基单晶高温合金热机械疲劳中的再结晶行为[J]. 金属学报, 2023, 59(9): 1221-1229.
[5] 冀秀梅, 侯美伶, 王龙, 刘玠, 高克伟. 基于机器学习的中厚板变形抗力模型建模与应用[J]. 金属学报, 2023, 59(3): 435-446.
[6] 张子轩, 于金江, 刘金来. 镍基单晶高温合金DD432的持久性能各向异性[J]. 金属学报, 2023, 59(12): 1559-1567.
[7] 徐静辉, 李龙飞, 刘心刚, 李辉, 冯强. 热力耦合对一种第四代镍基单晶高温合金1100℃蠕变组织演变的影响[J]. 金属学报, 2021, 57(2): 205-214.
[8] 和思亮, 赵云松, 鲁凡, 张剑, 李龙飞, 冯强. 热等静压对铸态及固溶态第二代镍基单晶高温合金显微缺陷及持久性能的影响[J]. 金属学报, 2020, 56(9): 1195-1205.
[9] 胡斌,李树索,裴延玲,宫声凯,徐惠彬. <111>取向小角偏离对一种镍基单晶高温合金蠕变性能的影响[J]. 金属学报, 2019, 55(9): 1204-1210.
[10] 马晋遥,王晋,赵云松,张剑,张跃飞,李吉学,张泽. 一种第二代镍基单晶高温合金1150 ℃原位拉伸断裂机制研究[J]. 金属学报, 2019, 55(8): 987-996.
[11] 张宇, 王清, 董红刚, 董闯, 张洪宇, 孙晓峰. 基于团簇模型设计的镍基单晶高温合金(Ni, Co)-Al-(Ta, Ti)-(Cr, Mo, W)及其在900 ℃下1000 h的长期时效行为[J]. 金属学报, 2018, 54(4): 591-602.
[12] 郭静, 李金国, 刘纪德, 黄举, 孟祥斌, 孙晓峰. 低偏析异质籽晶制备单晶高温合金的籽晶熔合区形成机制研究[J]. 金属学报, 2018, 54(3): 419-427.
[13] 王博,张军,潘雪娇,黄太文,刘林,傅恒志. W对第三代镍基单晶高温合金组织稳定性的影响[J]. 金属学报, 2017, 53(3): 298-306.
[14] 郁峥嵘,丁贤飞,曹腊梅,郑运荣,冯强. 第二、三代镍基单晶高温合金含Hf过渡液相连接*[J]. 金属学报, 2016, 52(5): 549-560.
[15] 濮晟,谢光,王莉,潘智毅,楼琅洪. Re和W对铸态镍基单晶高温合金再结晶的影响*[J]. 金属学报, 2016, 52(5): 538-548.