A New Methodology for Chinese Term Extraction from Scientific Publications
Keywords:
Automatic term extraction, Technical term extraction, Terminology extraction, Context information, Chinese term extractionAbstract
To identify Chinese technical terms, this study focuses on extracting terms from a corpus of scientific publications. The process begins with the identification of term boundaries, followed by the application of Chinese part-of-speech (POS) patterns to extract candidate terms. Features of words or characters that signal term boundaries are defined, enabling the segmentation of sentences into smaller units and facilitating the removal of irrelevant terms that may not be filtered by other approaches. POS patterns are specifically designed for the extraction of Chinese technical terms. A comparison between candidate terms extracted using these POS patterns and those obtained via n-gram models shows that the proposed POS-based method effectively eliminates a significant portion of non-relevant terms while retaining most useful ones. In the term scoring phase, a novel method based on contextual information—referred to as the Hellinger distance for context information acquisition—is introduced. This approach proves more effective than existing context-based methods. Subsequently, the Hellinger distance method is integrated with Kullback–Leibler divergence to evaluate terms along the dimensions of informativeness and phraseness. The proposed term scoring method is compared with eight alternative approaches. Results demonstrate that it outperforms others in scoring Chinese terms, particularly in the extraction of multi-word terms.
Downloads
References
Bourigault, D. Surface grammatical analysis for the extraction of terminological noun phrases. In Proceedings of the 14th International Conference on Computational Linguistics. Nantes, France, 1992, 977-981. https://doi.org/10.3115/993079.993111
Justeson, J. S., Katz, S. M. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural language engineering, 1995, 1(1), 9-27. https://doi.org/10.1017/S1351324900000048
Verberne, S., Sappelli, M., Hiemstra, D., et al. Evaluation and analysis of term scoring methods for term extraction. Information Retrieval Journal, 2016, 19(5), 510-545. https://doi.org/10.1007/s10791-016-9286-2
Salton, G., Buckley, C. Term-weighting approaches in automatic text retrieval. Information processing & management, 1988, 24(5): 513-523. https://doi.org/10.1016/0306-4573(88)90021-0
Frantzi, K., Ananiadou, S., Mima, H. Automatic recognition of multi-word terms: The C-value/NC-value method. International journal on digital libraries, 2000, 3(2), 115-130. https://doi.org/10.1007/s007999900023
Church, K., Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational linguistics, 1990, 16(1), 22-29.
Pantel, P., Lin, D. A statistical corpus-based term extractor. In Proceedings of the14th biennial conference of the Canadian society on computational studies of intelligence: Advances in artificial intelligence. Ottawa, 2001, 36-46. https://doi.org/10.1007/3-540-45153-6_4
Matsuo, Y., Ishizuka, M. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 2004, 13(01), 157-169. https://doi.org/10.1142/S0218213004001466
Tomokiyo, T., Hurst, M. A language model approach to keyphrase extraction. In Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment, 2003, 33-40. https://doi.org/10.3115/1119282.1119287
Kovačević, A., Konjović, Z., Milosavljević, B., et al. Mining methodologies from NLP publications: A case study in automatic terminology recognition. Computer Speech & Language, 2012, 26(2), 105-126. https://doi.org/10.1016/j.csl.2011.09.001
Judea, A., Schütze, H., Brügmann, S. Unsupervised training set generation for automatic acquisition of technical terminology in patents. In Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical Papers, 2014: 290-300.
Conrado, M., Pardo, T., Rezende, S. O. A machine learning approach to automatic term extraction using a rich feature set. In Proceedings of the 2013 NAACL HLT student research workshop, 2013, 16-23.
Lossio-Ventura, J. A., Jonquet, C., Roche, M., et al. Yet another ranking function for automatic multiword term extraction. In International conference on natural language processing. Cham, 2013, 52-64. https://doi.org/10.1007/978-3-319-10888-9_6
Lossio-Ventura, J. A., Jonquet, C., Roche, M., et al. Biomedical term extraction: overview and a new methodology. Information Retrieval Journal, 2016, 19(1), 59-99. https://doi.org/10.1007/s10791-015-9262-2
Ittoo, A., Bouma, G. Term extraction from sparse, ungrammatical domain-specific documents. Expert Systems with Applications, 2013, 40(7), 2530-2540. https://doi.org/10.1016/j.eswa.2012.10.067
Bolshakova, E., Loukachevitch, N., Nokel, M. Topic models can improve domain term extraction. In European Conference on Information Retrieval. Berlin, Heidelberg, 2013, 684-687. https://doi.org/10.1007/978-3-642-36973-5_60
Turney, P. D. Learning algorithms for keyphrase extraction. Information retrieval, 2000, 2(4), 303-336. https://doi.org/10.1023/A:1009976227802
Wermter, J., Hahn, U. You can’t beat frequency (unless you use linguistic knowledge)–a qualitative evaluation of association measures for collocation and term extraction. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 2003: 785-792. https://doi.org/10.3115/1220175.1220274
Yang, Y., Lu, Q., Zhao, T. A delimiter‐based general approach for Chinese term extraction. Journal of the American society for information science and technology, 2010, 61(1), 111-125. https://doi.org/10.1002/asi.21221
Zhou, L., Zhang, D. NLPIR: A theoretical framework for applying natural language processing to information retrieval. Journal of the American Society for Information Science and Technology, 2003, 54(2), 115-123. https://doi.org/10.1002/asi.10193
Hellinger, E. Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die reine und angewandte Mathematik, 1909, 1909(136), 210-271. https://doi.org/10.1515/crll.1909.136.210
Csiszár, I., Shields, P. C. Information theory and statistics: A tutorial. Foundations and Trends® in Communications and Information Theory, 2004, 1(4), 417-528. https://doi.org/10.1561/0100000004
Chen, Y. N., Huang, Y., Kong, S. Y., et al. Automatic key term extraction from spoken course lectures using branching entropy and prosodic/semantic features. In 2010 IEEE Spoken Language Technology Workshop. IEEE, 2010, 265-270. https://doi.org/10.1109/SLT.2010.5700862
Zhang, Z., Iria, J., Brewster, C., et al. A comparative evaluation of term recognition algorithms. In LREC (Vol. 5), 2008.
Rayson, P., Garside, R. Comparing corpora using frequency profiling. In Proceedings of the Workshop on Comparing Corpora, held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics (ACL 2000), 1-8 October 2000, Hong Kong. https://doi.org/10.3115/1117729.1117730
Matsuo, Y., Ishizuka, M. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 2004, 13(01), 157-169. https://doi.org/10.1142/S0218213004001466
Published
How to Cite
Issue
Section
Copyright (c) 2025 Huaili Zheng, Ting Jiang

This work is licensed under a Creative Commons Attribution 4.0 International License.