两两比较在汉语文本难度评估中的应用

doi:10.12139/j.1672-0628.2026.02.002

心理与行为研究 ›› 2026, Vol. 24 ›› Issue (2): 151-160.DOI: 10.12139/j.1672-0628.2026.02.002

两两比较在汉语文本难度评估中的应用

杨官豆豆¹, 谭静文^1,2, 刘苗苗^1,3, 李虹¹

1. 北京师范大学心理学部，应用实验心理北京市重点实验室，心理学国家级实验教学示范中心(北京师范大学)，儿童阅读与学习研究院，北京 100875;
2. 深圳市红岭小学，深圳 518000;
3. 河南师范大学教育学部，新乡 453007

收稿日期:2025-03-18 出版日期:2026-03-20 发布日期:2026-03-20
通讯作者: 李虹
基金资助:
国家语委“十四五”科研规划项目(WT45-41)；国家社会科学基金教育学一般项目“AI赋能因材施教：基于智能体的学习困难的多层级评估与个体化干预体系研究”(BBA250057)。

The Application of the Comparative Judgment in Chinese Text Difficulty Assessment

YANG Guandoudou¹, TAN Jingwen^1,2, LIU Miaomiao^1,3, LI Hong¹

1. Beijing Key Laboratory of Applied Experimental Psychology, National Demonstration Center for Experimental Psychology Education(Beijing Normal University), Institute of Children’s Reading and Learning, Faculty of Psychology, Beijing Normal University, Beijing 100875;
2. Shenzhen Hongling Primary School, Shenzhen 518000;
3. Faculty of Education, Henan Normal University, Xinxiang 453007

Received:2025-03-18 Online:2026-03-20 Published:2026-03-20

摘要/Abstract

摘要： 两两比较作为一种相对整体并能够高效获得可靠结果的文本难度评估手段，在汉语文本评估中的效果还有待探索。本研究通过80名评估者对80篇文本的两两比较，探讨了比较次数对两两比较信效度的影响。结果表明，两两比较获得的结果具有较高的信度，且与文本册数和可读性分数显著相关，随着比较次数的增加，信效度逐渐增加且趋于稳定，同时未发现评估者特征对比较结果的影响，这意味着两两比较在汉语文本难度评估中具备一定可靠性。

关键词: 汉语分级阅读, 文本难度, 两两比较, 小学语文教材

Abstract: One effective way to obtain a trustworthy assessment of text difficulty is to employ comparative judgment, a holistic assessment method that has not yet been utilized in Chinese research. The current study gathered 80 evaluators and evaluated the difficulty of 80 texts using the comparative judgment approach. The results showed that the results obtained from comparative judgment had high reliability and were significantly correlated with the number of volumes and readability scores. The reliability and validity tended to be stable with the increase in the number of comparative judgments. No effect of the evaluator characteristics on the comparative judgment results was found, which implies that comparative judgment is also reliable in the assessment of the difficulty of Chinese texts.

Key words: Chinese leveled reading, text difficulty, comparative judgment, elementary-school-level Chinese language and literature textbooks

中图分类号:

B842

杨官豆豆, 谭静文, 刘苗苗, 李虹. 两两比较在汉语文本难度评估中的应用[J]. 心理与行为研究, 2026, 24(2): 151-160.

YANG Guandoudou, TAN Jingwen, LIU Miaomiao, LI Hong. The Application of the Comparative Judgment in Chinese Text Difficulty Assessment[J]. Studies of Psychology and Behavior, 2026, 24(2): 151-160.

参考文献

陈茹玲, 蔡鑫廷, 宋曜廷, 李宜宪. (2015). 文本适读性分级架构之建立研究. 教育科学研究期刊, 60(1), 1–32
刘苗苗, 李燕, 王欣萌, 甘琳琳, 李虹. (2021). 分级阅读初探: 基于小学教材的汉语可读性公式研究. 语言文字应用, (2), 116–126
杨慊, 贺文洁, 王海龙. (2021). 单参数单维度Rasch模型的优势与意义. 心理科学, 44(6), 1491–1498
中国新闻出版研究院. (2022). 第十九次全国国民阅读调查结果. 2022-11-30取自https://society.huanqiu.com/article/47ix20UIt5x
Bartholomew, S. R., Ruesch, E. Y., Hartell, E., & Strimel, G. J. (2020). Identifying design values across countries through adaptive comparative judgment. International Journal of Technology and Design Education, 30(2), 321–347
Bloxham, S. (2009). Marking and moderation in the UK: False assumptions and wasted resources. Assessment & Evaluation in Higher Education, 34(2), 209–220
Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3–4), 324–345
Bramley, T. (2007). Paired comparison methods. In P. Newton, J. A. Baird, H. Goldstein, H. Patrick, & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 246–300). London: Qualifications and Curriculum Authority.
Bramley, T. (2015). Investigating the reliability of adaptive comparative judgment. Cambridge: Cambridge University Press & Assessment.
Bramley, T., & Vitello, S. (2019). The effect of adaptivity on the reliability coefficient in adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(1), 43–58.
Chall, J. S., & Conard, S. S. (1991). Should textbooks challenge students? :The case for easier or harder books. New York: Teachers College Press.
Chen, S. Y., & Fang, S. P. (2015). Developing a Chinese version of an author recognition test for college students in Taiwan. Journal of Research in Reading, 38(4), 344–360
Coertjens, L., Lesterhuis, M., Verhavert, S., van Gasse, R., & De Maeyer, S. (2017). Judging texts with rubrics and comparative judgement: Taking into account reliability and time investment. Pedagogische Studien, 94(4), 283–303
Crompvoets, E. A. V., Béguin, A. A., & Sijtsma, K. (2020). Adaptive pairwise comparison for educational measurement. Journal of Educational and Behavioral Statistics, 45(3), 316–338
Crossley, S., Heintz, A., Choi, J. S., Batchelor, J., Karimi, M., & Malatinszky, A. (2023). A large-scaled corpus for assessing text readability. Behavior Research Methods, 55(2), 491–507
Crossley, S. A., Skalicky, S., & Dascalu, M. (2019). Moving beyond classic readability formulas: New methods and new models. Journal of Research in Reading, 42(3–4), 541–561
Dale, E., & Chall, J. S. (1949). The concept of readability. Elementary English, 26(1), 19–26
Fountas, I. C., & Pinnell, G. S. (2012). Guided reading: The romance and the reality. The Reading Teacher, 66(4), 268–284
Fry, E. (2002). Readability versus leveling. The Reading Teacher, 56(3), 286–291
Jones, I., & Inglis, M. (2015). The problem of assessing problem solving: Can comparative judgement help? Educational Studies in Mathematics, 89(3), 337–355
Jones, I., Swan, M., & Pollitt, A. (2015). Assessing mathematical problem solving using comparative judgement. International Journal of Science and Mathematics Education, 13(1), 151–177
Kuhn, M. R., Schwanenflugel, P. J., & Meisinger, E. B. (2010). Aligning theory and assessment of reading fluency: Automaticity, prosody, and definitions of fluency. Reading Research Quarterly, 45(2), 230–251
Landrieu, Y., De Smedt, F., van Keer, H., & De Wever, B. (2022). Assessing the quality of argumentative texts: Examining the general agreement between different rating procedures and exploring inferences of (dis)agreement cases. Frontiers in Education, 7, 784261
Lesterhuis, M., Bouwer, R., van Daal, T., Donche, V., & De Maeyer, S. (2022). Validity of comparative judgment scores: How assessors evaluate aspects of text quality when comparing argumentative texts. Frontiers in Education, 7, 823895
Lesterhuis, M., van Daal, T., van Gasse, R., Coertjens, L., Donche, V., & De Maeyer, S. (2018). When teachers compare argumentative texts: Decisions informed by multiple complex aspects of text quality. L1-Educational Studies in Language and Literature, 18(1), 1–22
Liu, M. M., Li, Y. X., Su, Y. Q., & Li, H. (2024). Text complexity of Chinese elementary school textbooks: Analysis of text linguistic features using machine learning algorithms. Scientific Studies of Reading, 28(3), 235–255
Luce, R. D. (1959). Individual choice behavior: A theoretical analysis. New York: John Wiley & Sons, Inc.
Meng, X. L., Rosenthal, R., & Rubin, D. B. (1992). Comparing correlated correlation coefficients. Psychological Bulletin, 111(1), 172–175
Paquot, M., Rubin, R., & Vandeweerd, N. (2022). Crowdsourced adaptive comparative judgment: A community-based solution for proficiency rating. Language Learning, 72(3), 853–885
Pollitt, A. (2012). The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281–300.
Pollitt, A., & Murray, N. L. (1996). What raters really pay attention to. In M. Milanovic & N. Saville (Eds.), Studies in language testing 3: Performance testing, cognition and assessment (pp. 74–91). Cambridge: Cambridge University Press.
Renaissance. (2022). What kids are reading report 2022. Retrieved November 30, 2022, from https://www.renaissance.com/2022/03/01/news-renaissance-shares-findings-of-worlds-largest-annual-k12-reading-survey/
Sheehan, K. M., Kostin, I., Napolitano, D., & Flor, M. (2014). The TextEvaluator tool: Helping teachers and test developers select texts for use in instruction and assessment. The Elementary School Journal, 115(2), 184–209
Smith, D. R., Stenner, A. J., Horabin, I., & Smith, M. (1989). The lexile scale in theory and practice: Final report for NIH Grant HD-19448. Bethesda, MD: National Institutes of Health.
Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273–286
Thwaites, P., Kollias, C., & Paquot, M. (2024). Is CJ a valid, reliable form of L2 writing assessment when texts are long, homogeneous in proficiency, and feature heterogeneous prompts? Assessing Writing, 60, 100843
Verhavert, S., Bouwer, R., Donche, V., & De Maeyer, S. (2019). A meta-analysis on the reliability of comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(5), 541–562.
Verhavert, S., De Maeyer, S., Donche, V., & Coertjens, L. (2018). Scale separation reliability: What does it mean in the context of comparative judgment? Applied Psychological Measurement, 42(6), 428–445
Wheadon, C., Barmby, P., Christodoulou, D., & Henderson, B. (2020). A comparative judgement approach to the large-scale assessment of primary writing in England. Assessment in Education: Principles, Policy & Practice, 27(1), 46–64.

[1]	王娅珂, 张雨轩, 冯琳琳, 卡明芳, 梁菲菲. 小学三到五年级儿童阅读眼动模式的发展及其与阅读理解的关系[J]. 心理与行为研究, 2026, 24(2): 161-169.
[2]	卢林鑫, 刘在花, 刘红萍, 蔺秀云, 周瀚翔, 班永飞, 孙霁, 李小青, 张依晴, 黄海珍. 特校跨文化敏感性氛围与视障学生亲社会行为：双文化认同整合的多层贝叶斯中介模型[J]. 心理与行为研究, 2026, 24(2): 170-177.
[3]	王丹云, 王玉龙, 唐卓. 青少年运动习惯与自尊、心理韧性的动态关系：基于交叉滞后与随机截距交叉滞后模型的分析[J]. 心理与行为研究, 2026, 24(2): 178-186.
[4]	马燕, 王振宏. 道德自我知觉对欺骗行为的影响：解释水平的调节作用[J]. 心理与行为研究, 2024, 22(1): 39-45.
[5]	刘思远, 朱麟, 王瑞冰, 徐楚言, 王芸萍, 刘聪慧. 道德决策中是否存在方言效应？[J]. 心理与行为研究, 2024, 22(1): 31-38.
[6]	姚远青, 郭易安, 李春梅, 吴亚楠, 石雷, 赵广平. 几何图形社会角色隐喻的映射机制：行为和ERPs证据[J]. 心理与行为研究, 2024, 22(1): 23-30.
[7]	付春野, 吕勇. 预期与时间注意对视觉感知的影响[J]. 心理与行为研究, 2024, 22(1): 15-22.
[8]	钱程, 赵越, 牛溪溪, 顾佳灿, 王爱君. 三维空间深度位置上情绪面孔对返回抑制的影响[J]. 心理与行为研究, 2024, 22(1): 8-14.
[9]	郭梅华, 兰泽波, 巫金根, 李赛男, 吴俊杰, 闫国利. 汉语词切分和字号对阅读知觉广度的影响：眼动的证据[J]. 心理与行为研究, 2024, 22(1): 1-7.
[10]	陈汝淇, 包亚倩, 黄林洁琼, 李兴珊. 中文阅读中词语加工与眼动控制整合模型简介[J]. 心理与行为研究, 2023, 21(6): 725-735.
[11]	梁菲菲, 冯琳琳, 刘瑛, 王昶浩, 王洁. 词素位置概率信息在中文双字词识别中的作用：词汇语境多样性的调节[J]. 心理与行为研究, 2023, 21(6): 736-743.
[12]	于秒, 王文娣, 陈晓霄. 汉语“N的V”结构加工的韵律制约[J]. 心理与行为研究, 2023, 21(6): 744-750.
[13]	陈婉婷, 张逸飞, 何清华. 准确性提示降低错误信息的分享意愿[J]. 心理与行为研究, 2023, 21(6): 751-759.
[14]	马大付, 秦春影, 喻晓锋, 何催. 项目区分度指标在属性多水平和混合计分项目下的组卷研究[J]. 心理与行为研究, 2023, 21(6): 760-769.
[15]	刘蕾, 李亚楠, 牛若愚, 于文婷, 陈玉雪, 刘莹. 踏步任务下动作同步的神经基础[J]. 心理与行为研究, 2023, 21(5): 600-607.

两两比较在汉语文本难度评估中的应用

The Application of the Comparative Judgment in Chinese Text Difficulty Assessment

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价