The Application of the Comparative Judgment in Chinese Text Difficulty Assessment

doi:10.12139/j.1672-0628.2026.02.002

Abstract

Abstract:

One effective way to obtain a trustworthy assessment of text difficulty is to employ comparative judgment, a holistic assessment method that has not yet been utilized in Chinese research. The current study gathered 80 evaluators and evaluated the difficulty of 80 texts using the comparative judgment approach. The results showed that the results obtained from comparative judgment had high reliability and were significantly correlated with the number of volumes and readability scores. The reliability and validity tended to be stable with the increase in the number of comparative judgments. No effect of the evaluator characteristics on the comparative judgment results was found, which implies that comparative judgment is also reliable in the assessment of the difficulty of Chinese texts.

Key words: Chinese leveled reading, text difficulty, comparative judgment, elementary-school-level Chinese language and literature textbooks

摘要：

两两比较作为一种相对整体并能够高效获得可靠结果的文本难度评估手段，在汉语文本评估中的效果还有待探索。本研究通过80名评估者对80篇文本的两两比较，探讨了比较次数对两两比较信效度的影响。结果表明，两两比较获得的结果具有较高的信度，且与文本册数和可读性分数显著相关，随着比较次数的增加，信效度逐渐增加且趋于稳定，同时未发现评估者特征对比较结果的影响，这意味着两两比较在汉语文本难度评估中具备一定可靠性。

关键词: 汉语分级阅读, 文本难度, 两两比较, 小学语文教材

Guandoudou YANG, Jingwen TAN, Miaomiao LIU, Hong LI. The Application of the Comparative Judgment in Chinese Text Difficulty Assessment[J]. Studies of Psychology and Behavior, 2026, 24(2): 151-160.

杨官豆豆, 谭静文, 刘苗苗, 李虹. 两两比较在汉语文本难度评估中的应用[J]. 心理与行为研究, 2026, 24(2): 151-160.

Figures/Tables 9

References 0

	陈茹玲, 蔡鑫廷, 宋曜廷, 李宜宪. 文本适读性分级架构之建立研究. 教育科学研究期刊, 2015, 60 (1): 1- 32.
	刘苗苗, 李燕, 王欣萌, 甘琳琳, 李虹. 分级阅读初探: 基于小学教材的汉语可读性公式研究. 语言文字应用, 2021 (2): 116- 126. DOI
	杨慊, 贺文洁, 王海龙. 单参数单维度Rasch模型的优势与意义. 心理科学, 2021, 44 (6): 1491- 1498.
	中国新闻出版研究院. (2022). 第十九次全国国民阅读调查结果. 2022-11-30取自https://society.huanqiu.com/article/47ix20UIt5x
	Bartholomew, S. R., Ruesch, E. Y., Hartell, E., & Strimel, G. J. Identifying design values across countries through adaptive comparative judgment. International Journal of Technology and Design Education, 2020, 30 (2): 321- 347. DOI
	Bloxham, S. Marking and moderation in the UK: False assumptions and wasted resources. Assessment & Evaluation in Higher Education, 2009, 34 (2): 209- 220. DOI
	Bradley, R. A., & Terry, M. E. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 1952, 39 (3–4): 324- 345. DOI
	Bramley, T. (2007). Paired comparison methods. In P. Newton, J. A. Baird, H. Goldstein, H. Patrick, & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 246–300). London: Qualifications and Curriculum Authority.
	Bramley, T. (2015). Investigating the reliability of adaptive comparative judgment. Cambridge: Cambridge University Press & Assessment.
	Bramley, T., & Vitello, S. (2019). The effect of adaptivity on the reliability coefficient in adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(1), 43–58.
	Chall, J. S., & Conard, S. S. (1991). Should textbooks challenge students? :The case for easier or harder books. New York: Teachers College Press.
	Chen, S. Y., & Fang, S. P. Developing a Chinese version of an author recognition test for college students in Taiwan. Journal of Research in Reading, 2015, 38 (4): 344- 360. DOI
	Coertjens, L., Lesterhuis, M., Verhavert, S., van Gasse, R., & De Maeyer, S. Judging texts with rubrics and comparative judgement: Taking into account reliability and time investment. Pedagogische Studien, 2017, 94 (4): 283- 303.
	Crompvoets, E. A. V., Béguin, A. A., & Sijtsma, K. Adaptive pairwise comparison for educational measurement. Journal of Educational and Behavioral Statistics, 2020, 45 (3): 316- 338. DOI
	Crossley, S., Heintz, A., Choi, J. S., Batchelor, J., Karimi, M., & Malatinszky, A. A large-scaled corpus for assessing text readability. Behavior Research Methods, 2023, 55 (2): 491- 507. DOI
	Crossley, S. A., Skalicky, S., & Dascalu, M. Moving beyond classic readability formulas: New methods and new models. Journal of Research in Reading, 2019, 42 (3–4): 541- 561. DOI
	Dale, E., & Chall, J. S. The concept of readability. Elementary English, 1949, 26 (1): 19- 26.
	Fountas, I. C., & Pinnell, G. S. Guided reading: The romance and the reality. The Reading Teacher, 2012, 66 (4): 268- 284. DOI
	Fry, E. Readability versus leveling. The Reading Teacher, 2002, 56 (3): 286- 291.
	Jones, I., & Inglis, M. The problem of assessing problem solving: Can comparative judgement help?. Educational Studies in Mathematics, 2015, 89 (3): 337- 355. DOI
	Jones, I., Swan, M., & Pollitt, A. Assessing mathematical problem solving using comparative judgement. International Journal of Science and Mathematics Education, 2015, 13 (1): 151- 177. DOI
	Kuhn, M. R., Schwanenflugel, P. J., & Meisinger, E. B. Aligning theory and assessment of reading fluency: Automaticity, prosody, and definitions of fluency. Reading Research Quarterly, 2010, 45 (2): 230- 251. DOI
	Landrieu, Y., De Smedt, F., van Keer, H., & De Wever, B. Assessing the quality of argumentative texts: Examining the general agreement between different rating procedures and exploring inferences of (dis)agreement cases. Frontiers in Education, 2022, 7, 784261. DOI
	Lesterhuis, M., Bouwer, R., van Daal, T., Donche, V., & De Maeyer, S. Validity of comparative judgment scores: How assessors evaluate aspects of text quality when comparing argumentative texts. Frontiers in Education, 2022, 7, 823895. DOI
	Lesterhuis, M., van Daal, T., van Gasse, R., Coertjens, L., Donche, V., & De Maeyer, S. When teachers compare argumentative texts: Decisions informed by multiple complex aspects of text quality. L1-Educational Studies in Language and Literature, 2018, 18 (1): 1- 22. DOI
	Liu, M. M., Li, Y. X., Su, Y. Q., & Li, H. Text complexity of Chinese elementary school textbooks: Analysis of text linguistic features using machine learning algorithms. Scientific Studies of Reading, 2024, 28 (3): 235- 255. DOI
	Luce, R. D. (1959). Individual choice behavior: A theoretical analysis. New York: John Wiley & Sons, Inc.
	Meng, X. L., Rosenthal, R., & Rubin, D. B. Comparing correlated correlation coefficients. Psychological Bulletin, 1992, 111 (1): 172- 175. DOI
	Paquot, M., Rubin, R., & Vandeweerd, N. Crowdsourced adaptive comparative judgment: A community-based solution for proficiency rating. Language Learning, 2022, 72 (3): 853- 885. DOI
	Pollitt, A. (2012). The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281–300.
	Pollitt, A., & Murray, N. L. (1996). What raters really pay attention to. In M. Milanovic & N. Saville (Eds.), Studies in language testing 3: Performance testing, cognition and assessment (pp. 74–91). Cambridge: Cambridge University Press.
	Renaissance. (2022). What kids are reading report 2022. Retrieved November 30, 2022, from https://www.renaissance.com/2022/03/01/news-renaissance-shares-findings-of-worlds-largest-annual-k12-reading-survey/
	Sheehan, K. M., Kostin, I., Napolitano, D., & Flor, M. The TextEvaluator tool: Helping teachers and test developers select texts for use in instruction and assessment. The Elementary School Journal, 2014, 115 (2): 184- 209. DOI
	Smith, D. R., Stenner, A. J., Horabin, I., & Smith, M. (1989). The lexile scale in theory and practice: Final report for NIH Grant HD-19448. Bethesda, MD: National Institutes of Health.
	Thurstone, L. L. A law of comparative judgment. Psychological Review, 1927, 34 (4): 273- 286.
	Thwaites, P., Kollias, C., & Paquot, M. Is CJ a valid, reliable form of L2 writing assessment when texts are long, homogeneous in proficiency, and feature heterogeneous prompts?. Assessing Writing, 2024, 60, 100843. DOI
	Verhavert, S., Bouwer, R., Donche, V., & De Maeyer, S. (2019). A meta-analysis on the reliability of comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(5), 541–562.
	Verhavert, S., De Maeyer, S., Donche, V., & Coertjens, L. Scale separation reliability: What does it mean in the context of comparative judgment?. Applied Psychological Measurement, 2018, 42 (6): 428- 445. DOI
	Wheadon, C., Barmby, P., Christodoulou, D., & Henderson, B. (2020). A comparative judgement approach to the large-scale assessment of primary writing in England. Assessment in Education: Principles, Policy & Practice, 27(1), 46–64.

分组类别	分组依据	人数	相对阅读习惯分数	infit值	t/Z	p
专业	理科	29	21.45(3.26)	0.90(0.31)	–0.31	0.76
专业	文科	51	21.31(4.94)	0.93(0.40)	–0.31	0.76
高考语文成绩	119分及以下	38	21.29(5.29)	0.98(0.41)	–1.40	0.16
高考语文成绩	120分及以上	42	21.43(3.44)	0.87(0.32)	–1.40	0.16
相对于阅读习惯	≤21	35	17.54(2.95)	0.96(0.38)	–0.94	0.35
相对于阅读习惯	>21	45	24.33(2.67)	0.89(0.36)	–0.94	0.35
总计		80	21.36(4.38)	0.92(0.37)

分组类别	分组依据	人数	相对阅读习惯分数	infit值	t/Z	p
专业	理科	29	21.45(3.26)	0.90(0.31)	–0.31	0.76
专业	文科	51	21.31(4.94)	0.93(0.40)	–0.31	0.76
高考语文成绩	119分及以下	38	21.29(5.29)	0.98(0.41)	–1.40	0.16
高考语文成绩	120分及以上	42	21.43(3.44)	0.87(0.32)	–1.40	0.16
相对于阅读习惯	≤21	35	17.54(2.95)	0.96(0.38)	–0.94	0.35
相对于阅读习惯	>21	45	24.33(2.67)	0.89(0.36)	–0.94	0.35
总计		80	21.36(4.38)	0.92(0.37)

册数	文本篇数	字数	字种	词数	词种	句数	句长	可读性分数
册数	文本篇数	字数	字种	词数	词种	句数	句长	随机森林	支持向量机
1	10	70.20(36.67)	35.30(16.28)	52.40(29.27)	29.10(14.42)	4.30(2.00)	18.03(8.66)	1.24(0.13)	1.67(0.27)
2	10	115.90(55.67)	60.00(14.71)	87.30(44.49)	50.60(14.28)	7.20(4.21)	17.56(4.55)	2.04(0.22)	2.30(0.17)
3	10	163.60(48.80)	88.50(16.35)	118.30(39.34)	73.80(16.35)	9.40(3.60)	19.03(6.80)	2.97(0.17)	3.33(0.24)
4	10	237.10(61.03)	106.10(23.85)	174.80(45.47)	91.20(21.52)	13.90(5.26)	18.04(3.98)	3.86(0.15)	4.18(0.13)
5	10	361.90(114.09)	158.50(36.24)	253.10(87.20)	138.40(39.39)	19.40(7.63)	19.46(3.42)	5.06(0.25)	5.11(0.10)
6	10	434.40(54.73)	199.40(21.19)	300.40(45.40)	169.50(22.85)	17.30(5.33)	26.56(6.12)	6.22(0.16)	6.20(0.17)
7	10	513.20(66.14)	241.30(17.52)	333.80(51.20)	203.50(21.68)	21.50(4.77)	24.48(3.62)	7.23(0.15)	7.25(0.10)
8	10	716.10(175.72)	303.80(37.35)	458.40(127.33)	256.90(44.58)	24.60(8.07)	30.05(5.06)	8.36(0.24)	8.00(0.25)
总计	80	326.55(224.95)	149.11(91.24)	222.31(145.33)	126.63(78.80)	14.70(8.56)	21.65(6.92)	4.62(2.39)	4.75(2.17)

册数	文本篇数	字数	字种	词数	词种	句数	句长	可读性分数
册数	文本篇数	字数	字种	词数	词种	句数	句长	随机森林	支持向量机
1	10	70.20(36.67)	35.30(16.28)	52.40(29.27)	29.10(14.42)	4.30(2.00)	18.03(8.66)	1.24(0.13)	1.67(0.27)
2	10	115.90(55.67)	60.00(14.71)	87.30(44.49)	50.60(14.28)	7.20(4.21)	17.56(4.55)	2.04(0.22)	2.30(0.17)
3	10	163.60(48.80)	88.50(16.35)	118.30(39.34)	73.80(16.35)	9.40(3.60)	19.03(6.80)	2.97(0.17)	3.33(0.24)
4	10	237.10(61.03)	106.10(23.85)	174.80(45.47)	91.20(21.52)	13.90(5.26)	18.04(3.98)	3.86(0.15)	4.18(0.13)
5	10	361.90(114.09)	158.50(36.24)	253.10(87.20)	138.40(39.39)	19.40(7.63)	19.46(3.42)	5.06(0.25)	5.11(0.10)
6	10	434.40(54.73)	199.40(21.19)	300.40(45.40)	169.50(22.85)	17.30(5.33)	26.56(6.12)	6.22(0.16)	6.20(0.17)
7	10	513.20(66.14)	241.30(17.52)	333.80(51.20)	203.50(21.68)	21.50(4.77)	24.48(3.62)	7.23(0.15)	7.25(0.10)
8	10	716.10(175.72)	303.80(37.35)	458.40(127.33)	256.90(44.58)	24.60(8.07)	30.05(5.06)	8.36(0.24)	8.00(0.25)
总计	80	326.55(224.95)	149.11(91.24)	222.31(145.33)	126.63(78.80)	14.70(8.56)	21.65(6.92)	4.62(2.39)	4.75(2.17)

册数	文本篇数	两两比较分数
册数	文本篇数	R=20	R=40	R=60	R=80
1	10	2.99^a(0.87)	2.88^a(1.03)	3.27^a(1.11)	3.66^a(1.03)
2	10	0.85^b(1.19)	0.67^b(1.06)	1.50^b(1.47)	2.19^ab(1.08)
3	10	0.65^bc(1.68)	0.48^bc(1.13)	1.27^b(0.91)	1.48^b(1.05)
4	10	–0.73^bc(1.50)	–0.08^bc(1.13)	–0.05^bc(1.32)	–0.18^c(1.01)
5	10	–0.59^bc(1.39)	0.24^bc(1.17)	–0.42^c(1.11)	–0.76^cd(1.07)
6	10	–0.40^bc(1.39)	–0.92^cde(1.06)	–1.26^cd(0.88)	–1.17^cd(0.99)
7	10	–1.67^c(1.41)	–1.29^de(0.57)	–1.65^cd(1.11)	–1.80^d(1.41)
8	10	–1.10^bc(1.95)	–1.99^e(1.24)	–2.67^d(1.46)	–3.41^e(1.00)