Background
Artificial Intelligence in the form of chatbots is an emerging reality. Equipped with vision-language capabilities, models like GPT-4o or PathChat can process both images and text, representing a possibility of multimodal virtual assistants in healthcare. While there is an urgent need and growing expectation to adopt digital assistants to support clinical diagnostics, clinicians and researchers must question chatbots’ capability to answer diagnostic questions.
Aim
The aim of DALPHIN (DigitAL PatHology assIstant beNchmark) is to create a multicentric open benchmark for virtual assistants applied to diagnostic problems in digital pathology. Pathologists from multiple clinical centers will provide cases, consisting of histopathology regions of interest (ROIs), questions, and answers, across various pathology subspecialties. We will assess the performance of both general-purpose and pathology-specific chatbots on our benchmark, and compare this to the performance of pathologists with different levels of expertise. Ultimately, we plan to publicly release the benchmark on the Grand-Challenge platform, where submissions will be evaluated automatically and ranked on a leaderboard.