World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

Automating Lexical Complexity Measurement in Chinese with WeCLECA by:0 (Source: Crossref)

    Lexical complexity, generally understood as a multidimensional construct consisting of lexical density, sophistication, and diversity, has been recognized as an important construct in first language acquisition and second language acquisition. A large variety of lexical complexity measures have been proposed by researchers to study its relationship to second language learners’ writing and/or speaking proficiency. While this line of research has generated fruitful results for languages such as English and other Indo-European languages, less is known about how it may be applicable to a typologically distance language such as Chinese. In this paper, we report the design of a computational tool for automating the measurement of lexical complexity with 25 indices. The Batch Mode of the software supports analyses of a large number of .txt files and outputs results to a .CSV file suitable for importation into statistical packages for further analyses. In an example application, we analyzed 87 texts from three distinct registers (academic prose, fiction, and press reportage) from the Lancaster Corpus of Mandarin Chinese (McEnery and Xiao, The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study, in Proc. Fourth Int. Conf. Language Resources and Evaluation, eds. M. Lino, M. Xavier, F. Ferreira, R. Costa and R. Silva, (European Language Resources Association, Paris, 2014), pp. 1175–1178) and explored how register variation may manifest in lexical complexity. The ANOVA analyses showed linear increase in multiple lexical sophistication and diversity measures across academic prose, fiction and press reportage. The results showed that press reportage has lower lexical density, but higher lexical sophistication and diversity than academic prose; fiction also has higher lexical diversity than academic prose. This paper concluded with discussions of potential pedagogical applications for Chinese language teaching and learning.


    • 1. K. Wolfe-Quintero, S. Inagaki and K.-Y. Kim , Second Language Development in Writing: Measures of Fluency, Accuracy and Complexity (Second Language Teaching and Curriculum Center, University of Hawaii, 1998). Google Scholar
    • 2. K. Hyltenstam , Lexical characteristics of near-native second-language learners of Swedish, J. Multiling. Multicult. Dev. 9 (1998) 67–84, CrossrefGoogle Scholar
    • 3. B. Bulté and A. Housen , Defining and operationalising L2 complexity, in Dimensions of L2 performance and Proficiency: Investigating Complexity, Accuracy and Fluency in SLA, eds. A. HousenF. KuikenI. Vedder (John Benjamins, Amsterdam, 2012), pp. 21–46. CrossrefGoogle Scholar
    • 4. S. A. Crossley, T. Salsbury, D. S. McNamara and S. Jarvis , Predicting lexical proficiency in language learner texts using computational indices, Lang. Test. 28 (2011) 561–580, CrossrefGoogle Scholar
    • 5. A. Zareva, P. Schwanenflugel and Y. Nikolova , Relationship between lexical competence and language proficiency: Variable sensitivity, Stud. Second Lang. Acquis. 27 (2005) 567–595, CrossrefGoogle Scholar
    • 6. K. Kyle and S. A. Crossley , Automatically assessing lexical sophistication: Indices, tools, findings, and application, TESOL Q. 49 (2014) 757–786, CrossrefGoogle Scholar
    • 7. B. Laufer and P. Nation , Vocabulary size and use: Lexical richness in L2 written production, Appl. Linguist. 16 (1995) 307–322, CrossrefGoogle Scholar
    • 8. X. Lu , The relationship of lexical richness to the quality of ESL learners’ oral narratives, Mod. Lang. J. 96 (2012) 190–208, CrossrefGoogle Scholar
    • 9. G. Yu , Lexical diversity in writing and speaking task performances, Appl. Linguist. 31 (2000) 236–259, CrossrefGoogle Scholar
    • 10. A. Vermeer , Coming to grips with lexical richness in spontaneous speech data, Lang. Test. 17 (2000) 65–83, CrossrefGoogle Scholar
    • 11. A. C. Graesser, D. S. McNamara, M. M. Louwerse and Z. Cai , Coh-Metrix: Analysis of text on cohesion and language, Behav. Res. Meth. Instrum. Comput. 36 (2004) 193–202, CrossrefGoogle Scholar
    • 12. D. S. McNamara, A. C. Graesser, P. M. McCarthy and Z. Cai , Automated Evaluation of Text and Discourse with Coh-Metrix (Cambridge University Press, England, 2014). CrossrefGoogle Scholar
    • 13. W. Bo, J. Chen, K. Guo and T. Jin , Data-driven adapting for fine-tuning Chinese teaching materials: Using corpora as benchmarks, in Computational and Corpus Approaches to Chinese Language Learning, eds. X. LuB. Chen (Springer, Singapore, 2019), pp. 99–118. CrossrefGoogle Scholar
    • 14. J.-H. Chen, X.-Q. Cai, B.-C. Guo, C.-H. Liao and Y.-M. Yang , Research on computer automated textual analysis and word categorization, in The E-Learning and Information Technology Symp. (EITS-2013) (Southern Taiwan University of Science and Technology, Taiwan, 2013). Google Scholar
    • 15. Y. Sung, T. Chang, W. Lin, K.-S. Hsieh and K.-E. Chang , CRIE: An automated analyzer for Chinese texts, Behav. Res. Meth. 48 (2016) 1238–1251, CrossrefGoogle Scholar
    • 16. T. McEnery and R. Xiao , The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study, in Proc. Fourth Int. Conf. Language Resources and Evaluation, eds. M. LinoM. XavierF. FerreiraR. CostaR. Silva (European Language Resources Association, Paris, France, 2014), pp. 1175–1178. Google Scholar
    • 17. J. Ure , Lexical density: A computational technique and some findings, in Talking About Text, ed. M. Coultard (University of Birmingham, Birmingham, England, 1971), pp. 27–48. Google Scholar
    • 18. M. A. K. Halliday , Spoken and Written Language (Deakin University Press, Melbourne, Australia, 1985). Google Scholar
    • 19. K. O’Loughlin , Lexical density in candidate output of direct and semi-direct versions of an oral proficiency test, Lang. Test. 12 (1995) 217–237, CrossrefGoogle Scholar
    • 20. J. Read , Assessing Vocabulary (Oxford University Press, Oxford, England, 2000). CrossrefGoogle Scholar
    • 21. M. Linnarud , Lexis in Composition: A Performance Analysis of Swedish Learners’ Written English (CWK Gleerup, Lund, Sweden, 1986). Google Scholar
    • 22. B. Laufer , The lexical profile of second language writing: Does it change over time? RELC J. 25 (1994) 21–33, CrossrefGoogle Scholar
    • 23. G. Xue and P. Nation , A university word list, Lang. Learn. Commun. 3 (1984) 215–229. Google Scholar
    • 24. B. Harley and M. L. King , Verb lexis in the written compositions of young L2 learners, Stud. Second Lang. Acquis. 11 (1989) 415–440, CrossrefGoogle Scholar
    • 25. C. Chaudron and K. Parker , Discourse markedness and structural markedness: The acquisition of English noun phrases, Stud. Second Lang. Acquis. 12 (1990) 43–64, CrossrefGoogle Scholar
    • 26. J. B. Carroll , Language and Thought (Prentice-Hall, Englewood Cliffs, NJ, 1964). Google Scholar
    • 27. D. Malvern, B. Richards, N. Chipere and P. Duran , Lexical Diversity and Language Development: Quantification and Assessment (Palgrave MacMillan, Houndmills, UK, 2004). CrossrefGoogle Scholar
    • 28. T. Klee , Developmental and diagnostic characteristics of quantitative measures of children’s language production, Top. Lang. Disord. 12 (1992) 28–41, CrossrefGoogle Scholar
    • 29. J. F. Miller , Quantifying productive language disorders, in Research in Child Language Disorders: A Decade of Progress, ed. J. F. Miller (Pro-Ed, Austin, TX, 1991), pp. 211–220. Google Scholar
    • 30. E. Thordardottir and S. Ellis Weismer , High frequency verbs and verb diversity in the spontaneous speech of school-age children with specific language impairment, Int. J. Lang. Commun. Disord. 36 (2011) 221–244, CrossrefGoogle Scholar
    • 31. M. Templin , Certain Language Skills in Children: Their Development and Interrelationships (The University of Minnesota Press, Minneapolis, 1957). CrossrefGoogle Scholar
    • 32. P. J. L. Arnaud , Objective lexical and grammatical characteristics of L2 written compositions and the validity of separate-component tests, in Vocabulary and Applied Linguistics, eds. P. J. L. ArnaudH. Béjoint (MacMillan, London, England, 1992), pp. 133–145. CrossrefGoogle Scholar
    • 33. W. Johnson , Studies in language behavior: I. A program of research, Psychol. Monogr. 56 (1944) 1–15, CrossrefGoogle Scholar
    • 34. P. Guiraud , Problèmes et méthodes de la statistique linguistique [Problems and methods of statistical linguistics] (D. Reidel, Dordrecht, The Netherlands, 1960). Google Scholar
    • 35. G. Herdan , Quantitative Linguistics (Butterworths, London, England, 1964). Google Scholar
    • 36. D. Dugast , Vocabulaire et stylistique. I Théâtre et dialogue [Vocabulary and style. Vol. 1 Theatre and dialogue] (Slatkine-Champion, Geneva, Switzerland, 1979). Google Scholar
    • 37. H. Daller, R. Van Hout and J. Treffers-Daller , Lexical richness in the spontaneous speech of bilinguals, Appl. Linguist. 24 (2003) 197–222, CrossrefGoogle Scholar
    • 38. C. P. Casanave , Language development in students’ journals, J. Second Lang. Writ. 3 (1994) 179–201, CrossrefGoogle Scholar
    • 39. C. A. Engber , The relationship of lexical proficiency to the quality of ESL compositions, J. Second Lang. Writ. 4 (1995) 139–155, CrossrefGoogle Scholar
    • 40. E. McClure , A comparison of lexical strategies in L1 and L2 written English narratives, Pragmat. Lang. Learn. 2 (1991) 141–154. Google Scholar
    • 41. D. Graddol , The future of language, Science 303 (2004) 1329–1331, CrossrefGoogle Scholar
    • 42. M. P. Lewis , Ethnologue: Languages of the World, 16th edn. (SIL International, Dallas, TX, 2009). Google Scholar
    • 43. H. Li, A. C. Graesser, M. Conley, Z. Cai, P. I. Pavlik and J. W. Pennebaker , A new measure of text formality: An analysis of discourse of Mao Zedong, Discourse Process. 53 (2016) 205–232, CrossrefGoogle Scholar
    • 44. B.-B. Song, X.-B. Zhou and T. Jin , A study on the coverage and of words exceeding the standards of vocabulary, Chin. Lang. Learn. 3 (2017) 95–104 (in Chinese). Google Scholar
    • 45. Y. Li, X. Zhou, Y. Zhou, F. Mao, S. Shen, Y. Lin, X. Zhang, T.-H. Chang and Q. Sun , Evaluation of the quality and readability of online information about breast cancer in China, Patient Educ. Couns. 104 (2021) 858–864, CrossrefGoogle Scholar
    • 46. Q. Liu, H.-P. Zhang, H.-K. Yu and Y.-Q. Cheng , Chinese lexical analysis using cascaded hidden Markov model, J. Comput. Res. Dev. 41 (2004) 1421–1429, Google Scholar
    • 47. H.-P. Zhang, H.-K. Yu, D.-Y. Xiong and Q. Liu , HHMM-based Chinese lexical analyzer ICTCLAS, in Proc. Second SIGHAN Workshop on Chinese Language Processing, Vol. 17 (Association for Computational Linguistics, 2003), pp. 184–187. CrossrefGoogle Scholar
    • 48. D. Biber , University Language: A Corpus-based Study of Spoken and Written Registers (John Benjamins, Amsterdam, 2006). CrossrefGoogle Scholar
    • 49. C. A. Ferguson , Sports announcer talk: Syntactic aspects of register variation, Lang. Soc. 12 (1983) 153–172, CrossrefGoogle Scholar
    • 50. D. Biber, S. Johansson, G. Leech, S. Conrad and E. Finegan , Longman Grammar of Spoken and Written English, (Longman, London, England, 1999). Google Scholar
    • 51. S. Conrad and D. Biber , The frequency and use of lexical bundles in conversation and academic prose, Lexicographica 20 (2004) 56–71, CrossrefGoogle Scholar
    • 52. M. Zhang , A multidimensional analysis of metadiscourse markers across written registers, Discourse Stud. 18 (2016) 204–222, CrossrefGoogle Scholar
    • 53. J. Zhao , Duiwai hanyu jiaoxue gailun (Introduction to Teaching Chinese as a Foreign Language) (Xinxuelin, Taipei, 2009) (in Chinese). Google Scholar