引用本文:廖委真,韩优莉,马骋宇.医疗健康领域中对话式人工智能的评估范式:系统综述[J].中国卫生政策研究,2025,18(7):78-86 |
|
医疗健康领域中对话式人工智能的评估范式:系统综述 |
投稿时间:2025-03-31 修订日期:2025-05-22 PDF全文浏览 HTML全文浏览 |
廖委真,韩优莉,马骋宇 |
首都医科大学公共卫生学院 北京 100069 |
摘要:目的 系统梳理医疗健康对话式人工智能(Artificial Intelligence,AI)的评估范式,为促进医疗AI评估体系构建及评估方法改进提供参考。方法 采用系统综述方法,分析医疗健康对话式AI的评估范式,包括评估对象、评价指标、评估方法等。结果 共纳入60篇文献,评估对象以通用大模型为主,评价指标涵盖技术性能、信息质量、临床效果、用户体验、伦理与安全五个维度,已有研究采用的评估指标差异较大。存在测评问题与应用场景匹配度不高、评估者角色单一等问题。结论 当前医疗健康对话式AI评估体系尚不完善,未来应从模型类型的覆盖面、评估指标体系的综合性、评估方法的标准化、测试内容的可操作性及评估语言的可扩展性等方面,完善医疗健康对话式AI的评估范式。 |
关键词:大语言模型 对话式AI 评价指标 评估体系 医疗健康 |
基金项目:国家社会科学基金项目(24BGL273) |
|
Evaluation paradigms for conversational AI in healthcare: Systematic review |
LIAO Wei-zhen, HAN You-li, MA Cheng-yu |
School of Public Health, Capital Medical University, Beijing 100069, China |
Abstract:Objective This study aims to systematically review the current evaluation paradigms of conversational AI in healthcare and provide insights to facilitate the development of a comprehensive evaluation framework and methodological advancements in this field.Methods A systematic review was conducted by searching the PubMed and Web of Science databases to analyze the existing evaluation paradigms of healthcare conversational AI, including evaluation subjects, assessment metrics, and evaluation methodologies.Results A total of 60 studies were included in this review. The findings indicate that most evaluation subjects focus on general-purpose large language models. The assessment metrics cover five key dimensions: technical performance, information quality, clinical effectiveness, user experience, and ethics and safety. However, there were significant differences in the evaluation criteria used in existing studies. There were also issues such as a low degree of alignment between the evaluation questions and the application scenarios, as well as a lack of diversity in the roles of the evaluators.Conclusions The current evaluation framework for healthcare conversational AI remains underdeveloped. Future improvements should focus on broadening model coverage, enhancing the comprehensiveness of evaluation indicators, standardizing evaluation methods, improving the operationalizability of test content, and expanding the scalability of evaluation languages. |
Key words:Large language model Conversational AI Evaluation index Evaluation system Healthcare |
摘要点击次数: 99 全文下载次数: 20 |
|
|