Benchmarking Deep Learning-Based Image Retrieval of Oral Tumor Histology

Introduction: Oral tumors necessitate a dependable computer-assisted pathological diagnosis system considering their rarity and diversity. A content-based image retrieval (CBIR) system using deep neural networks has been successfully devised for digital pathology. No CBIR system for oral pathology has been investigated because of the lack of an extensive image database and feature extractors tailored to oral pathology. Materials and methods: This study uses a large CBIR database constructed from 30 categories of oral tumors to compare deep learning methods as feature extractors. Results: The highest average area under the receiver operating characteristic curve (AUC) was achieved by models trained on database images using self-supervised learning (SSL) methods (0.900 with SimCLR and 0.897 with TiCo). The generalizability of the models was validated using query images from the same cases taken with smartphones. When smartphone images were tested as queries, both models yielded the highest mean AUC (0.871 with SimCLR and 0.857 with TiCo). We ensured the retrieved image result would be easily observed by evaluating the top 10 mean accuracies and checking for an exact diagnostic category and its differential diagnostic categories. Conclusion: Training deep learning models with SSL methods using image data specific to the target site is beneficial for CBIR tasks in oral tumor histology to obtain histologically meaningful results and high performance. This result provides insight into the effective development of a CBIR system to help improve the accuracy and speed of histopathology diagnosis and advance oral tumor research in the future.


Introduction
Oral tumors are generally composed of diverse and rare tumor types, except for major categories like squamous cell carcinoma.Distinguishing oral tumor types is difficult except for well-experienced oral pathologists.The rarity of oral tumors and the diverse tissue types in the oral region make obtaining reference images for diagnosis and research a challenge, potentially leading to delayed diagnosis and a significant burden on pathologists [1].Consequently, a diagnostic system is needed to improve the speed and accuracy of histopathological diagnosis of these tumors [2].Artificial intelligence (AI) is a promising solution for efficient histopathological diagnosis of oral tumors.AI development for oral tumor diagnosis is limited and focused only on a few tumor types.Classification methods have been developed to predict the diagnosis, such as differentiating between ameloblastoma or odontogenic keratocysts, to which a histopathological image may belong [3,4].These approaches are helpful in common cases.However, a computer-aided diagnostic system, specifically for histological images, which covers a broader spectrum of tumor types would be more practical and help narrow the differential diagnoses.Therefore, content-based image retrieval (CBIR) is suitable.CBIR is a method of obtaining images that are relevant to a query image from a large collection of images based on their visual content.CBIR regards histopathological images as query images to find similar images from a database based on their similar morphology [2,5].This system is useful as a diagnostic aid for finding case references, especially where diagnostic expertise is challenging to find, such as in low-to middle-income countries [1].The involvement of human intervention is crucial in diagnosis.Conventionally, pathologists diagnose directly after hematoxylin and eosin (H&E)-stained slide analysis or optionally use different methods as diagnostic aids: referring atlases, consulting subspecialist experts, or conducting ancillary tests.An automatic image search can complement these options to expedite image reference search.With scarce pathological expertise, a tool that could provide urgently needed information for rapid diagnosis before conducting tests to raise a definitive one would be significant [6].CBIR provides interpretability because it presents multiple candidate images, which is beneficial when distinguishing between categories based on histopathological images alone, which is challenging, such as when information on dental infections or radiographic findings is needed.With CBIR, the retrieved results are to be evaluated by pathologists, reducing the risk of misdiagnosis owing to inaccurate results, especially for categories with very similar histology.
The CBIR system consists of two aspects: image feature extraction and nearest-neighbor search.While nearest-neighbor search is implemented in the last step of image retrieval to locate the data points in high-dimensional space that is closest to the query point, feature extraction is implemented first and it is crucial because it must adequately capture complex histological features such as staining patterns, tissue structures, and cellular morphology to create histologically relevant image representation [2,7,8].The extracted features must be robust to irrelevant color variations, such as different H&E stain brands, glass slide color degradation, and image-capturing devices ranging from whole-slide image (WSI) scanners to smartphone cameras [5,8,9].At the early stage of CBIR development, traditional image features such as shape, color, texture, or a combination were used.Recent developments showed that deep learning models outperformed traditional features [6,7].Several deep learning methods, such as supervised learning where models are pre-trained on general images or fine-tuned on histopathological images, have been used to train feature extractors [9][10][11][12], and self-supervised learning (SSL), which allows learning from unlabeled images [13][14][15].However, no studies have reviewed which method is most suitable for CBIR in oral tumors.
This study aimed to investigate the performance of different deep learning models for oral tumor CBIRs by developing a large dataset of WSIs from 541 cases with 51 tumor types and evaluating the retrieval accuracy by comparing different representational learning techniques.
This article was previously posted to the medRxiv preprint server on May 31, 2024.

Dataset, database, and test queries
We collected diagnostic slides of the oral tumor categories described in chapters 7 and 8 of the WHO Classification of Head and Neck Tumours, 4th Edition [16] from Medical and Dental University (TMDU) Hospital  and scanned them to obtain the WSIs.Representative tumor regions were annotated, and image patches were randomly extracted.The dataset consists of 49,243 image patches from 51 categories, covering approximately 50% of the oral tumor categories (Table A1 in the Appendices).
A database from a subset of the dataset was compiled to develop an automated oral tumor image reference search (Figure 1).Image representations from each model's encoder were stored in the database (Figure 2A).It contains 33,356 image patches from 30 oral tumor categories (Table A1 in the Appendices).

FIGURE 1: CBIR's place in the pathological diagnosis workflow.
CBIR optionally provides an interpretable automatic reference search that is fast and easily followed up with a more thorough study with atlases, discussions, or ancillary tests.CBIR could help point out similar features from the previously diagnosed image in the database that may lead to testable differential diagnoses more swiftly than directly consulting atlases or senior experts, which may cause the patient delayed treatment.CBIR: content-based image retrieval; H&E: hematoxylin and eosin.
Image created with BioRender.
We prepared three query sets to test the performance.Query case set A was collected from TMDU Hospital.Histopathologic slides were scanned to create WSIs for in-domain queries.Three selected tumor areas that are typical of the tumor type from the same slides were photographed with smartphone cameras to create out-of-domain-phonecam queries.Query case set B (out-of-domain B) was compiled from the University of Tokyo Hospital and query case set C (out-of-domain C) was collected from Teikyo University Hospital.The number of images analyzed in each category is detailed in Tables A3-A5 in the Appendices.
The representation of each query image was calculated for each tested model.A nearest-neighbor search was performed based on cosine similarity with the database images.Examples of query images for each category in each set can be found in Figures A1-A3 in the Appendices.The details of the methods of dataset image collection, database construction, and test queries are available in Appendix Method 1.The database construction methods, including tumor area selection, patch extraction, feature extraction code, and image retrieval, were adapted from our previous study [17].

Evaluation metrics and statistical analysis
The area under the receiver operating characteristic curve (AUC) for all query images with top-k retrieved images (k ranges from 1 to the total number of cases in the database) being the cut-points were averaged into Mean-AUC.Based on the top 10 images most similar to the query, three additional metrics were evaluated.Mean-Acc denotes the mean of the top 10 diagnostic accuracies (Acc) for each query.%query denotes the percentage of results that contained at least one accurate diagnosis category.The histological similarity in the retrieved results beyond diagnostic accuracy was evaluated by noting the retrieved images that did not belong to the accurate diagnosis category or any of its differential diagnoses [16,18,19] (Table A2 in the Appendices).The values are expressed as histologic inaccuracy (HI) and were averaged to determine the mean HI.Image retrieval and all statistical analyses were conducted using Python 3.7.12and R 4.2.2 (R Foundation for Statistical Computing, Vienna, Austria).

Results
The highest Mean-AUC for in-domain queries was achieved by ResNet18+SimCLR (0.900), followed by ResNet18+TiCo (0.897).They achieved this at the query category level for eight out of the 11 categories.The generalizability of these models was validated using out-of-domain-phonecam queries.The highest performance for out-of-domain-phonecam queries was also achieved by ResNet18+SimCLR (0.871), followed by ResNet18+TiCo (0.857).The highest performance on the query case category levels was achieved by both SSL models for seven out of the 11 categories.We tested the performance on queries from other institutions to further demonstrate the generalizability.A similar result was yielded by both SSL models, where ResNet18+SimCLR leads with the highest Mean-AUC (0.886 for out-of-domain-B and 0.913 for out-ofdomain-D queries), followed by ResNet18+TiCo (0.881 for out-of-domain-B and 0.905 for out-of-domain-D queries).Phikon, pre-trained on histopathological images, and DINOv2, pre-trained on large-scale general images, performed comparably well with DINOv2 leading the overall Mean-AUC in out-of-domainphonecam query, with Phikon leading in the other three query sets (Table 1).A3 in the Appendices).The highest overall Mean-Acc was consistently achieved by the SSL models at different magnification levels, except for the high-magnification in-domain queries (Figures 3A, 3B).The highest %query was obtained with SSL models for most query categories (Table A4 in the Appendices).
The accuracy calculation excluded the histologic similarity between the query and retrieved images, which provides additional information about the histologic features during diagnosis.To verify whether the SSL models retrieved histologically similar images despite the low Mean-Acc, Mean-HI was introduced.Mean-HI excludes accurate diagnosis and differential diagnosis categories, which are similar to the query and include other inaccurate categories.The lowest overall inaccuracy was consistently achieved by the SSL models, except for the high-magnification in-domain queries, indicating that these models are best at retrieving the most histologically similar images beyond accurate diagnosis (Figures 3C, 3D and Table A5 in the Appendices).The top 10 retrieved images of representative cases found by all the tested models are shown in Figure 4 and Figure A5 in the Appendices.

Discussion
The diagnosis of oral pathology has long depended on histopathological image observation, which can be a burden for pathologists, especially when dealing with rare cases.In the last decade, various machine learning methods, such as supervised learning methods for classification, detection, and segmentation, have been proposed to aid in clinical and histopathological diagnosis and to improve speed and accuracy to avoid delays in diagnosis [3,4,[9][10][11][12]26].However, the exploration for oral histopathology diagnosis has been hampered by the difficulty of obtaining an adequately extensive database that includes rare cases and constructing an effective model.To our knowledge, this is the first study to construct a large database of 30 oral tumor categories, with an additional 21 categories used as the model-training dataset.
Many have argued that CBIR has greater advantages in this field.Pathologists can review CBIR results to make a final decision.However, decision bias may occur when the algorithm is unreliable.To find the best way to represent images for CBIR, we compared different methods of training the feature encoder.We then ranked the similarity of all images in the database to the test queries.The gradual concept of similarity and the multiple-ranked results of CBIR pose a challenge in interpretation.Previous research on CBIR for histopathology images uses the number of accurate image categories retrieved or the majority of categories retrieved at the top-k [10,27].Here, four evaluation measures were used to better capture the CBIR performance on different aspects: Mean-AUC from the whole database similarity rank; Mean-Acc, %query, and Mean-HI from the top 10 most similar images.Mean-AUC assumes all rank cut-points are relevant to model performance in extracting histologic features, while the top 10 similar results are relevant during image observation by pathologists in future CBIR implementation.Our findings suggest that model training for feature extraction using an in-house dataset with SSL methods outperforms other popular methods in retrieving images with an accurate diagnosis and similar histology: in-domain queries in 73% of categories and out-of-domain-phonecam queries in 64% of categories (Table 1).This was supported by their Mean-Acc, which was superior in 63% of in-domain query categories and 82% of out-of-domain-phonecam query categories (Table A3 in the Appendices).There was no significant difference in Mean-Acc between the SimCLR and TiCo models for both query categories.The Mean-Acc superiority of the SSL models was consistent at the in-domain low and moderate magnification levels (Figure 3A) and at out-of-domainphonecam at all magnification levels (Figure 3B).Additionally, both SSL models retrieved fewer images without histologic similarity at all magnification levels for both query categories (Figures 3C, 3D), meaning when low accuracy is achieved in the result, the users get several options that are histologically similar upon CBIR implementation with the SSL models because they belong to the textbook differential diagnosis categories.From this result, users could proceed with the additional tests more easily than having to do the preliminary reference search manually.Our dataset had overlapping image patches with similar histologic features.In this situation, the SSL method was superior because it compensated for the lack of a labeled dataset for learning representations that cluster the data during training based on semantic classes in conjunction with convolutional neural networks as feature extractors, regardless of the category [15].
The most impressive performance was shown for Warthin's tumor query, with Mean-AUC values greater than 0.960 in every query set using the SSL models (Table 1).One possible reason is that Warthin's tumors consist of varying proportions of papillary cystic structures lined by two layers of oncocytic epithelial cells and a lymphoid stroma with germinal centers.It is one of the most common tumors of the salivary gland, especially the parotid gland, and is generally easy to diagnose microscopically owing to its characteristic pattern [18].
The SLL models were successful for most of the test query categories.Out of those categories, the Mean-Acc of the SLL model for the ameloblastoma query was lower (Table A1 in the Appendices).Although ameloblastoma is one of the most common odontogenic tumors, it has diverse histologic variants: follicular, plexiform, acanthomatous, granular, basaloid, desmoplastic, or a mixture of these [16].This diversity requires an adequate representation of each subtype in the database for greater accuracy.However, the %query indicated that the models retrieved a similar ameloblastoma type in the top 10 for more than 93% of the queries tested (Table A4 in the Appendices).With several differential diagnoses of ameloblastoma included in the database, the best Mean-HI obtained for the in-domain ameloblastoma query was 5.98 by SimCLR (Table A5 in the Appendices).These categories can be considered histologically similar only if the characteristics of certain subtypes are captured.For example, islands of odontogenic epithelium with ameloblastic features in the follicular type may resemble ameloblastic fibroma [18].Updating the database with newly encountered subtypes continuously would improve the accuracy of rare tumor subtypes.
Although CBIR works by retrieving similar images that can be considered a digital second opinion, the result may contain images from different categories, with many having similar or indistinguishable histology.Arguably, the range of Mean-Acc values obtained with the SSL models, i.e., 1.00 to 8.94 (Table 1 and Table A3 in the Appendices), is considerably wide.However, 55% to 100% of the total queries retrieved at least one of 10 images from the correct category (Table A4 in the Appendices), and 6.46 to 1.12 out of the 10 images had no histologic similarity to the query image (Table A5 in the Appendices).This implies that displaying the complete top 10 results, including the correct diagnosis and differential diagnosis, as shown in Figure 4 and Figure A5 in the Appendices, could be significant for pathologists to narrow the differential diagnoses and conduct further research efficiently.To further improve usability, it is necessary to include clinical and other findings, such as the location of the tumor, patient history, and diagnostic criteria, which are usually essential to making a diagnosis by pathologists, when developing a CBIR system, especially in the oral region where tissue types are diverse.
This study implements patch-based CBIR.Some CBIR systems can analyze WSIs, of which implementation is prospective in developed countries.As expensive WSI scanners are not universally installed in oral laboratories, the image-capturing equipment accessible to pathologists differs considerably across regions.Microscope images captured directly using a smartphone camera could be the easiest mode for education, image sharing, and case consultation [28,29].By using patch-based CBIR where pathologists only need to select the tumor areas and capture them with smartphone cameras to create input, this technology is more accessible globally.Variations in image color and resolution resulting from these differences hinder obtaining reliable results.We tested the robustness of each model to domain shifts by testing the models on out-of-domain queries using WSIs from multiple institutions captured by different scanners and smartphone cameras.SSL models performed best for most query categories, with SimCLR or TiCo achieving the best Mean-AUC for over 68% of out-of-domain query categories, from 0.839 to 0.999 (Table 1), confirming the previous finding that SSL is more robust to domain shifts than supervised learning in some datasets, including pathological images [30].Interestingly, the performances of the vision transformer models (pretrained DINOv2 and Phikon) always come second best to the SSL models in Mean-AUC and are comparable in Mean-Acc to that of the SSL models on out-of-domain query sets (Table 1 and Table A3 in the Appendices).Although further investigation is needed, this result may be considered when choosing methods for a CBIR system.If the system is designed for an in-house database and query or a scenario applicable in large hospitals, SSL models trained on in-house cases are the optimal choice.However, where the system is designed to handle out-of-domain queries, using a pre-trained vision transformer model becomes a viable alternative to eliminate the need to train the SSL model, which could be computationally expensive.
Limitations of this study include that the experiments involved a single query to retrieve similar images.An algorithm that supports more information in the query, such as multiple query algorithms and filters for location or other diagnostic criteria, would improve retrieval accuracy and provide better support for diagnosis.Our study is limited to test queries from the same geographical area as the SSL model-training dataset.Collecting query cases from a more diverse area would be beneficial in future CBIR development to further challenge the generalizability of the result.The comparative methodology did not emphasize histopathology characteristics that differentiate between benign or malignant tumors, such as capsule invasion and mitotic activity in basal cell adenoma vs. basal cell adenocarcinoma but focused on how such image retrieval tools would be beneficial in reducing several differential diagnoses and recalling diagnosis criteria before following up with ancillary tests if necessary.Image retrieval is less likely to mislead decisionmakers owing to model overfit than a conventional classification method that predicts the possible tumor diagnosis.Nonetheless, a sequel of observations would still be needed when image retrieval is utilized.This study provides insights as the first step to developing a CBIR algorithm by observing retrieval accuracy with strict tumor category criteria on a relatively small database and did not investigate the impact of the result on decision-making in clinical settings.The implementation of CBIR as a well-rounded system to be incorporated into the comprehensive diagnostic process is beyond the scope of this study and observation of the interaction between pathologists and a CBIR system for common and rare diagnoses is needed before the system is used in clinical settings.

Conclusions
This study highlighted various methods to develop an effective CBIR model and proposed four key measures to determine the best approach for future clinical usage.These measures capture different aspects of CBIR performance that would be relevant for diagnosis decisions in clinical settings.We have shown that using SSL methods trained on an oral tumor dataset is an effective way to develop a CBIR system for the histopathological diagnosis of oral tumors compared to other commonly used methods.Vision transformer models trained on a large image dataset, though slightly less effective than SSL models, still provide strong performance and could be a viable alternative for out-of-domain queries.These approaches have considerable potential to create a clinically useful image retrieval system that accelerates the diagnostic process and improves accuracy.

Dataset
We collected diagnostic slides of the oral tumor categories described in chapters 7 and 8 of the WHO Classification of Head and Neck Tumours, 4th Edition [16].Patients were diagnosed in 2001-2022 and underwent surgery at Tokyo Medical and Dental University (TMDU) Hospital.Patients or their surrogates had the option to withdraw from this study through public notices according to the approved protocols.This study was approved by the Institutional Review Board (IRB) of TMDU (No. D2019-087).Some slides that were lost, broken, or required diagnostic confirmation with immunohistochemistry (IHC) staining were remade from the paraffin-embedded tissue blocks.Additional IHC staining was done for the secretory carcinoma and the atypical acinic cell carcinoma cases older than 2017.Categories with fewer than five cases were excluded.All slides that fulfill the inclusion criteria were scanned using a NanoZoomer S210 slide scanner (C13239-01; Hamamatsu Photonics, Shizuoka, Japan) at 40× magnification.The tumor areas were annotated by a pathology resident and verified by board-certified pathologists.We included the tumor areas that are typical to the tumor category while excluding the normal tissue and the severe artifacts such as torn or folded tissue.Image patches were then randomly extracted from the annotated tumor areas with three different sizes: 905 µm, 453 µm, and 226 µm.Twenty image patches were extracted with each magnification.The use of three different sizes was to accommodate different magnification levels and preserve the histologic information at the tissue and cellular level as much as possible.The dataset comprises 49,243 image patches from 51 categories, covering approximately 50% of the oral tumor categories (Table A1).

Database Construction
A database from a subset of the dataset containing at least 10 cases was compiled.Image representations from each model's encoder were stored in the database (Figure 2A).It contains 33,356 image patches from 30 oral tumor categories (Table A1).

Test Queries
The test queries were the cases available in the hospital repository after the collection of the database case was finished and were representatives of the tumor major categories in the database: odontogenic cysts, odontogenic tumors, benign and malignant tumors of the salivary gland, maxillofacial bone tumor, and soft tissue tumor.Slides that include severe artifacts that are impossible to avoid when extracting image patches were excluded.We prepared three query sets from different hospitals to test the performance: query case set-A was collected between 2022 and 2023 from the repository of the same institution (TMDU) as the database.Only the tumor categories that have two representative cases were included as the test queries.For the WSI queries, the same method was used to create image patches from the WSI as from the database image.All scanned WSIs were annotated to three to four representative tumor areas per WSI.From these areas, 20 image patches per magnification level (905 µm, 453 µm, and 226 µm) per slide were extracted to create image patches.The total number of image patches used for the evaluation was 2,520 images from the WSI and 594 smartphone images.The number of images analyzed in each category is detailed in Tables A3-A5.
The representation of each query image was calculated with each tested model.The nearest-neighbor search was performed based on cosine similarity with the database images.Examples of query images for each category in each set can be found in Figures A1-A3.The detailed methods for database construction, including the tumor areas selection, patch extraction, feature extraction code, and image retrieval were adapted from our previous study [17].

ImageNet-1k Pre-trained CNN
We tested VGG16 pre-trained on 1.2 million images from ImageNet after it was shown to perform well as an image classifier in several studies [21,22].In this study, the block4_conv3 layer was used as the feature extractor because the middle layer of a convolutional neural network (CNN) architecture has been shown to capture features that are more suitable for histopathology images [11].

ImageNet-22k Pre-trained Vision Transformer
We used the DINOv2 ViT-L/14 model pre-trained with the SSL method on general images from ImageNet-22k [25].The last layer was used for feature extraction.The images were cropped to 252 pixels owing to the input size restrictions.

Fine-Tuned CNN
We fine-tuned all the ImageNet-1k pre-trained ResNet18 models on our dataset using the supervised learning method to classify 51 categories.All the layers were trained with a learning rate of 0.001, a batch size of 32, and 100 epochs in PyTorch 1.11.0 (Meta AI, Astor Place, New York).During training, a random 90degree rotation, random horizontal and vertical flips, color jitter, Gaussian blur, and color normalization transformation were performed.The training-to-test ratio was 8:2.

Histopathology Image-Trained Vision Transformer
A vision transformer-based model, Phikon, was trained with 40 million pan-cancer tiles extracted from The Cancer Genomic Atlas (TCGA) using the masked image modeling (MIM) method as an SSL framework.MIM learns meaningful representation by randomly masks portions of an image and trying to reconstruct those masked portions.This model was developed by Owkin Inc. (Paris, France) [23].The total number of image patches extracted from whole-slide images is 1,320 images.The total number of image patches extracted from the whole-slide images is 660 images.The performances of the SSL models are superior for most categories, further validating the robustness of the models under a wide range of histopathological image conditions.The highest Mean-Acc for each category is marked in italics.
SSL: self-supervised learning; Acc: accuracies.The SSL models consistently best retrieved the histologic similarity of the results.It is noteworthy that a lower Mean-HI is preferable.The lowest Mean-HI for each category is marked in italics.

FIGURE 2 :
FIGURE 2: CBIR workflow and SSL models training process.(A) Schematic representation of CBIR using deep neural networks to retrieve similar oral tumor histopathological images.The similarity is determined by a nearest-neighbor search, which calculates the cosine similarity of the query image's image representation to all database image representations in the embedding space.(B) The training process of the SSL models used ResNet18 as an encoder.The image representations were passed to a projector and subjected to feature normalization.In the SimCLR method, the training loss function yields a low value when the representation of the original image (X′) and its augmentation (X′′) are close together, while it yields a larger value when X′ and a different image augmentation (Y′′) are far apart.In the Transformation Invariance and Covariance Contrast (TiCo) method, the process still pulls X′ and X′′ close, and then the redundancy in the representation is removed using covariance contrast without using a different image (Y′′).CBIR: content-based image retrieval; SSL: self-supervised learning.Image created with BioRender.

FIGURE 3 :
FIGURE 3: Mean-Acc and Mean-HI comparisons.(A) Mean-Acc comparison for the in-domain query of all models by magnification showing the highest performance of Phikon at the highest magnification and that of ResNet18+SimCLR and ResNet18+TiCo at the moderate and lowest magnification.(B) Mean-Acc comparison for out-of-domain-phonecam queries of all models by magnification shows the highest performance of ResNet18+SimCLR and ResNet18+TiCo at the highest and lowest magnification.Both model performances were comparable to that of pre-trained DINOv2 at moderate magnification.(C) Mean-HI comparison for in-domain queries by magnification shows that ResNet18+SimCLR and ResNet18+TiCo outperformed other models except for the highest magnification where Phikon leads with a wider interquartile range.(D) Mean-HI comparison for out-of-domain-phonecam query by magnification showing ResNet18+SimCLR and ResNet18+TiCo outperformed other models.(C-D) Please note that a lower Mean-HI value denotes a higher model performance.Acc: accuracies; HI: histologic inaccuracy.

FIGURE 4 :
FIGURE 4: Comparison of the top 10 results of all models for out-ofdomain-phonecam queries from different categories.ResNet18+SimCLR and ResNet18+TiCo are consistent with the result that provides the highest Acc for different query categories.More examples are presented in Figure A5 in the Appendices.Green outline: accurate diagnosis category; yellow outline: differential diagnosis categories; red outline: inaccurate diagnosis category.

Finally, 11
tumor categories with two cases in each were used as the test queries.Histopathologic slides were scanned to create WSIs for in-domain queries with the same device as the database image.Three selected tumor areas that are typical of the tumor type from the same slides were photographed with smartphone cameras (Samsung S21FE: 12 megapixels; iPhone 6: 8 megapixels; and Motorola g8: 16 megapixels) using an Olympus BX53 microscope with 10×, 20×, and 40× objective lens magnification.All smartphone images were cropped to square, removing the outer rim of the microscope lens, to create out-ofdomain-phonecam queries.The location (indoor laboratory environment) and the amount of light from the microscope were unchanged when taking the smartphone images.The query case set B was compiled from the University of Tokyo case (approved by the IRB of The University of Tokyo; No. 2019158NI).Eleven cases from eight salivary gland tumor categories were included in this study.Only benign and malignant tumors of the salivary gland could be collected for out-of-domain-B since no other major categories exist in the repository.Histopathologic slides were scanned using a NanoZoomer 2.0HT slide scanner (C9600-12, Hamamatsu Photonics) to create WSIs for out-of-domain-B queries.The query case set C was collected from Teikyo University Hospital from 2018 to 2023 (approved by the IRB of Teikyo University; No. 23-054).Only the tumor categories that have two representative cases were included as the test queries.Twenty cases from eight oral tumor categories were included in the study.The histopathologic slides were scanned using a NanoZoomer XR slide scanner (C12000-02; Hamamatsu Photonics) to create WSIs for out-of-domain-C queries.Patients for query cases set-B and set-C or their surrogates had the option to withdraw from this study through public disclosure according to the approved protocols.
[24]) were investigated.ResNet18 is used as the backbone.During training, random color jittering, grayscale, image scaling, horizontal and vertical flipping, 90-degree rotation, Gaussian blurring, and color augmentation were implemented.Both models were trained with Lightly version 1.3.3, with a learning rate of 1.2, a batch size of 32 × 32 (with accumulated gradients), and 1,000 epochs.Examples of original and augmented images are shown in FigureA4.The ResNet18 backbone model trained on 57 histopathology image datasets (38,594 image patches and 24,923 WSIs) developed byCiga et al. (

FIGURE 5 :
FIGURE 5: Examples of set-A query images from each category for each magnification level.

FIGURE 6 :
FIGURE 6: Examples of set-B query images from each category for each magnification level.

FIGURE 7 :
FIGURE 7: Examples of set-C query images from each category for each magnification level.The total number of image patches extracted from the whole-slide images is 1,200 images.

FIGURE 9 :
FIGURE 9: Further examples of comparisons of the top 10 results of all models for out-of-domain-phonecam queries from different categories.ResNet18+SimCLR and ResNet18+TiCo are consistent with the result providing the highest Acc for different query categories.The categories of retrieved images belonging to the differential diagnoses show which retrieved images have histologic similarity to the query.This comparison demonstrates the SSL model's capability to retrieve histologically similar images when the exact accurate diagnosis is not retrieved.The pre-trained DINOv2 and Phikon also show such potential albeit less consistently across the query category than the self-supervised learning (SSL) models.Green outline: accurate diagnosis category; yellow outline: differential diagnosis categories; red outline: inaccurate diagnosis categories.

TABLE 1 :
Mean-AUC (SD) of each test query category in in-and out-of-domain image queries.The performances of the SSL models are superior for most test query categories and the overall averages.The highest AUC for each category is marked in italics.2024Herdiantoputriet al.Cureus 16(6): e62264.DOI 10.7759/cureus.622646 of 28

TABLE 2 : The tumor categories included in the dataset, their corresponding ICD-O codes, and the total number of cases.
total of 51 categories are compiled as the image dataset.Some categories do not correspond to the ICD-O but are described in the WHO Classification of Head and Neck Tumours, 4th Edition.Thirty categories consisting of 10 to 20 cases, marked in italics, were included in the CBIR database.
A ICD-O: International Classification of Diseases for Oncology; CBIR: content-based image retrieval; NOS: not otherwise specified; MALT: mucosaassociated lymphoid tissue.

TABLE 3 : List of test query categories and all their respective differential diagnoses that are represented in the database for Mean-HI evaluation.
NOS: not otherwise specified; HI: histologic inaccuracy.

TABLE 5 : Percentage of queries that retrieved at least one correct diagnosis among the top 10 results for each test query category (%query) in in-and out-of-domain image queries.
The performance of the SSL models was superior in most categories.The highest %query for each category is marked in italics (%).