Computational Pathology
Computational Pathology, in a nutshell, is the application of machine learning and data science methods to digital pathology data. This data may include image data (e.g., scanned histological slides, also known as whole slide images (WSI)) as well as text data (e.g., diagnostic reports).
A prerequisite for Computational Pathology is the digital availability of such data. The process of digitizing pathology data requires both hardware and software tools, collectively referred to as Digital Pathology.

The Computational Pathology Heidelberg has been responsible for developing Digital Pathology pipelines since 2023. Routine diagnostics and research applications are addressed separately:
- In routine clinical practice, the goal is to digitize cases to apply computational tools for well-defined use cases where a clear clinical benefit can be anticipated.
- In scientific research, the goal is to establish a flexible pipeline that can apply machine learning methods to image data, text data, or their combination. This research pipeline also serves as a testbed for future tools intended for clinical deployment.

Fig. 1. Example of an anonymized slide with tissue
If you are interested in collaboration or joining us for research projects such as an internship, medical doctor, or master's thesis, feel free to contact pathodigital (at) med.uni-heidelberg.de
Scientific Projects
Within the above-described framework, our scientific projects, while self-contained, integrate seamlessly into our established computational pathology pipelines. To further assist physicians, the following projects are currently being researched:
Within the above-described framework, our scientific projects, while self-contained, integrate seamlessly into our established computational pathology pipelines. To further assist physicians, the following projects are currently being researched:
Slide Naming and Metadata Extraction: The BabelFish and BabelShark Projects
Retrieving the names and metadata of scanned pathology slides is a non-trivial challenge due to the wide variability in how essential information—such as case IDs or staining types—is labeled or positioned. To address this, we have developed an OCR-based tool capable of extracting such information and storing it in a structured JSON format. This tool is named BabelFish, inspired by The Hitchhiker’s Guide to the Galaxy.
We are currently extending this work under the project name BabelShark, with the aim of improving robustness and scalability.

Case Retrieval Based on Pathology Reports: The ReportHarvester
A common starting point for many tissue-based research projects is the identification of relevant cases through pathology reports. These reports, typically written in free text, use diverse terminologies to describe similar findings. While keyword-based searches can yield preliminary candidate cases, human review is still necessary to confirm their relevance.
To streamline this process, we are developing a user-friendly and powerful clinical text search tool, called DiagnosticReportHarvester, designed to better identify and prioritize reports of interest, thus reducing the manual workload.
For this, we utilize Elasticsearch for efficient indexing and database management, ensuring compatibility with existing SQL-based clinical databases. To enable semantic search and topic modeling, we also leverage clinical large language models (LLMs), allowing for improved contextual understanding of medical texts. To further refine text search queries, we automatically expand user-provided search terms with synonyms using the Unified Medical Language System (UMLS). Additionally, we employ medspaCy NLP tools to detect modified entities (e.g., negations), and diagnostic chapters relevant to pathology reports. The proposed system is designed as a secure and self-hostable web application, ensuring ease of deployment and accessibility for clinical users.

Reconstructing Histological Puzzles
During routine histological processing, the submitted tissue specimens are dissected into smaller fragments, each of which is processed and analyzed separately. This process is analogous to voxel-based segmentation in medical imaging, such as in radiology. However, unlike radiology, the fragments are not routinely reassembled into representations of the original anatomical structure. This lack of reconstruction poses challenges for tasks such as assessing resection margins or correlating histological findings with imaging data.
To address this, we are developing a tool designed to reconstruct the original spatial relationships of histological fragments, thereby facilitating a more comprehensive analysis.

Correlating Text and Images in Histopathology
Histopathological diagnosis is deeply rooted in visual interpretation, which relies on both descriptive terminology and the application of conceptual frameworks. For instance, a diagnosis involving a “circumscribed follicular lesion with a fibrous capsule and no vascular invasion” blends objective observations with learned interpretive concepts.
Traditionally, such knowledge is acquired through textbooks, lectures, and expert annotations. In this project, we aim to develop image-to-text machine learning models trained on such multimodal sources to bridge the gap between visual features and diagnostic language.
LuFi (Lung Fibrosis)
This project focuses on AI-assisted image analysis using Multiple Instance Learning (MIL) to enhance the pathological diagnosis of interstitial lung diseases. MIL enables the model to learn from whole-slide images without requiring detailed annotations, making it suitable for large-scale histopathological data analysis. A key objective is to identify novel diagnostically relevant histomorphological patterns that may serve as new biomarkers. By uncovering these patterns, LuFi aims to support more accurate, consistent, and data-driven diagnostic decisions in pathology.

WSI Quality Control
Optimizing HistoQC for Routine Digital Pathology Workflow
Accurate quality control is vital in digital pathology. HistoQC, an open‑source tool for assessing whole slide images, uses predefined thresholds that may not suit every lab, staining protocol, or scanner. This project aims to optimize and customize HistoQC’s thresholds by analyzing a diverse set of pathology slides and refining its settings for better performance.
Through this process, we will make HistoQC more adaptive, precise, and robust — reducing errors, supporting high‑throughput clinical and research applications, and ultimately standardizing quality control across institutions.

Software
- Code development with Git
- Python Projects
- Open-Source libraries (e.g. CUDA & PyTorch)
Depending on the specific research project, additional software stacks are required for OCR, RAG LLM or WSI analysis
Team
Head
PD Dr. med. Cleo-Aron Weis, M.Sc.
Research scientists
Shahram Aliyari, M. Sc.
Christoph Blattgerste, M.Sc.
Dr. med. Nils Englert, M.Sc.
Ayk Jessen, M.Sc.
Maximilian Legnar, M.Sc.
Balamurugan Thirukonda S. B., M. Sc.
System administrators
Ayk Jessen, M.Sc.
Maximilian Legnar, M.Sc.
Balamurugan Thirukonda S. B., M. Sc.
Operation manager
Balamurugan Thirukonda S. B., M. Sc.
Publications
- Blattgerste, C., Weis, C.-A., Hesser, J. (2024). Extending the margin of pathological tissue using a computer vision pipeline. In: Fresquet, X. and Hesser, J. 1st. Joint DFH/UFA workshop on AI in Medicine: Optimised Trials with Machine Learning. Heidelberg University Library. pp.4-8. https://doi.org/10.11588/heidok.00035481
- Englert N*, Schwab C*, Legnar M, Weis C-A. Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool. J Pathol Inform. 2024. 23;15:100402. doi: 10.1016/j.jpi.2024.100402
*both authors contributed equally
- Rusche D, Englert N, Runz M, Hetjens S, Langner C, Gaiser T, Weis C-A. Unraveling a Histopathological Needle-in-Haystack Problem: Exploring the Challenges of Detecting Tumor Budding in Colorectal Carcinoma Histology. Appl. Sci. 2024; 14(2):949. https://doi.org/10.3390/app14020949
- Legnar, M.; Siemoneit, J.-H.H.; Vandewiele, G.; Hesser, J.; Popovic, Z.; Porubsky, S.; Weis, C.-A. Investigating and Optimizing MINDWALC Node Classification to Extract Interpretable Decision Trees from Knowledge Graphs. Mach. Learn. Knowl. Extr. 2025, 7, 16. https://doi.org/10.3390/make7010016
- Legnar, M.; Daumke, P.; Hesser, J.; Porubsky, S.; Popovic, Z.; Bindzus, J.N.; Siemoneit, J.-H.H.; Weis, C.-A. Natural Language Processing in Diagnostic Texts from Nephropathology. Diagnostics 2022, 12, 1726. https://doi.org/10.3390/diagnostics12071726