A New Methodology for Chinese Term Extraction from Scientific Publications

Huaili Zheng; Ting Jiang

doi:10.61187/ita.v3i2.222

Authors

Huaili Zheng
zhenghuaili@nufe.edu.cn
School of Computer and Artificial Intelligence, Nanjing University of Finance & Economics, Nanjing, China
Ting Jiang School of Computer and Artificial Intelligence, Nanjing University of Finance & Economics, Nanjing, China

Keywords:

Automatic term extraction, Technical term extraction, Terminology extraction, Context information, Chinese term extraction

Abstract

To identify Chinese technical terms, this study focuses on extracting terms from a corpus of scientific publications. The process begins with the identification of term boundaries, followed by the application of Chinese part-of-speech (POS) patterns to extract candidate terms. Features of words or characters that signal term boundaries are defined, enabling the segmentation of sentences into smaller units and facilitating the removal of irrelevant terms that may not be filtered by other approaches. POS patterns are specifically designed for the extraction of Chinese technical terms. A comparison between candidate terms extracted using these POS patterns and those obtained via n-gram models shows that the proposed POS-based method effectively eliminates a significant portion of non-relevant terms while retaining most useful ones. In the term scoring phase, a novel method based on contextual information—referred to as the Hellinger distance for context information acquisition—is introduced. This approach proves more effective than existing context-based methods. Subsequently, the Hellinger distance method is integrated with Kullback–Leibler divergence to evaluate terms along the dimensions of informativeness and phraseness. The proposed term scoring method is compared with eight alternative approaches. Results demonstrate that it outperforms others in scoring Chinese terms, particularly in the extraction of multi-word terms.

Downloads

Download data is not yet available.

References

Bourigault, D. Surface grammatical analysis for the extraction of terminological noun phrases. In Proceedings of the 14th International Conference on Computational Linguistics. Nantes, France, 1992, 977-981. https://doi.org/10.3115/993079.993111

Justeson, J. S., Katz, S. M. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural language engineering, 1995, 1(1), 9-27. https://doi.org/10.1017/S1351324900000048

Verberne, S., Sappelli, M., Hiemstra, D., et al. Evaluation and analysis of term scoring methods for term extraction. Information Retrieval Journal, 2016, 19(5), 510-545. https://doi.org/10.1007/s10791-016-9286-2

Salton, G., Buckley, C. Term-weighting approaches in automatic text retrieval. Information processing & management, 1988, 24(5): 513-523. https://doi.org/10.1016/0306-4573(88)90021-0

Frantzi, K., Ananiadou, S., Mima, H. Automatic recognition of multi-word terms: The C-value/NC-value method. International journal on digital libraries, 2000, 3(2), 115-130. https://doi.org/10.1007/s007999900023

Church, K., Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational linguistics, 1990, 16(1), 22-29.

Pantel, P., Lin, D. A statistical corpus-based term extractor. In Proceedings of the14th biennial conference of the Canadian society on computational studies of intelligence: Advances in artificial intelligence. Ottawa, 2001, 36-46. https://doi.org/10.1007/3-540-45153-6_4

Matsuo, Y., Ishizuka, M. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 2004, 13(01), 157-169. https://doi.org/10.1142/S0218213004001466

Tomokiyo, T., Hurst, M. A language model approach to keyphrase extraction. In Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment, 2003, 33-40. https://doi.org/10.3115/1119282.1119287

Kovačević, A., Konjović, Z., Milosavljević, B., et al. Mining methodologies from NLP publications: A case study in automatic terminology recognition. Computer Speech & Language, 2012, 26(2), 105-126. https://doi.org/10.1016/j.csl.2011.09.001

Judea, A., Schütze, H., Brügmann, S. Unsupervised training set generation for automatic acquisition of technical terminology in patents. In Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical Papers, 2014: 290-300.

Conrado, M., Pardo, T., Rezende, S. O. A machine learning approach to automatic term extraction using a rich feature set. In Proceedings of the 2013 NAACL HLT student research workshop, 2013, 16-23.

Lossio-Ventura, J. A., Jonquet, C., Roche, M., et al. Yet another ranking function for automatic multiword term extraction. In International conference on natural language processing. Cham, 2013, 52-64. https://doi.org/10.1007/978-3-319-10888-9_6

Lossio-Ventura, J. A., Jonquet, C., Roche, M., et al. Biomedical term extraction: overview and a new methodology. Information Retrieval Journal, 2016, 19(1), 59-99. https://doi.org/10.1007/s10791-015-9262-2

Ittoo, A., Bouma, G. Term extraction from sparse, ungrammatical domain-specific documents. Expert Systems with Applications, 2013, 40(7), 2530-2540. https://doi.org/10.1016/j.eswa.2012.10.067

Bolshakova, E., Loukachevitch, N., Nokel, M. Topic models can improve domain term extraction. In European Conference on Information Retrieval. Berlin, Heidelberg, 2013, 684-687. https://doi.org/10.1007/978-3-642-36973-5_60

Turney, P. D. Learning algorithms for keyphrase extraction. Information retrieval, 2000, 2(4), 303-336. https://doi.org/10.1023/A:1009976227802

Wermter, J., Hahn, U. You can’t beat frequency (unless you use linguistic knowledge)–a qualitative evaluation of association measures for collocation and term extraction. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 2003: 785-792. https://doi.org/10.3115/1220175.1220274

Yang, Y., Lu, Q., Zhao, T. A delimiter‐based general approach for Chinese term extraction. Journal of the American society for information science and technology, 2010, 61(1), 111-125. https://doi.org/10.1002/asi.21221

Zhou, L., Zhang, D. NLPIR: A theoretical framework for applying natural language processing to information retrieval. Journal of the American Society for Information Science and Technology, 2003, 54(2), 115-123. https://doi.org/10.1002/asi.10193

Hellinger, E. Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die reine und angewandte Mathematik, 1909, 1909(136), 210-271. https://doi.org/10.1515/crll.1909.136.210

Csiszár, I., Shields, P. C. Information theory and statistics: A tutorial. Foundations and Trends® in Communications and Information Theory, 2004, 1(4), 417-528. https://doi.org/10.1561/0100000004

Chen, Y. N., Huang, Y., Kong, S. Y., et al. Automatic key term extraction from spoken course lectures using branching entropy and prosodic/semantic features. In 2010 IEEE Spoken Language Technology Workshop. IEEE, 2010, 265-270. https://doi.org/10.1109/SLT.2010.5700862

Zhang, Z., Iria, J., Brewster, C., et al. A comparative evaluation of term recognition algorithms. In LREC (Vol. 5), 2008.

Rayson, P., Garside, R. Comparing corpora using frequency profiling. In Proceedings of the Workshop on Comparing Corpora, held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics (ACL 2000), 1-8 October 2000, Hong Kong. https://doi.org/10.3115/1117729.1117730

Matsuo, Y., Ishizuka, M. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 2004, 13(01), 157-169. https://doi.org/10.1142/S0218213004001466