Assessment of Multiple Atlas-Based Segmentation in Prostate Bed Contouring

Dice similarity coefficients (DSC) of single best matching (SBM) and multiple best matching (MBM) prostate bed automated atlas-based segmentation (AABS) contours were compared to an expert panel gold standard. DSC scores improved with MBM in bladder (0.73-0.82) and penile bulb (0.40-0.54), with no improvement in other organs. Categories: Radiation Oncology, Urology, Oncology


Introduction
Advancements in radiotherapy, such as intensity-modulated radiotherapy (IMRT) and image-guided radiotherapy (IGRT), improve therapeutic outcome by facilitating dose escalation in target tissue, while sparing adjacent normal tissues [1]. These advancements also demand high-contouring accuracy from radiation oncologists in order to optimize patient outcomes. Yet, manual contouring can be tedious and time-consuming. Furthermore, inter-and intra-observer variability is a major source of uncertainty [2]. To reduce time spent contouring as well as to decrease uncertainty, automated contouring techniques are being increasingly explored in the medical literature.
The use of computer-assisted auto-contouring algorithms, such as automated atlas-based segmentation (AABS), are a promising new approach to overcome the limitations of manual contouring. AABS begins by automatically selecting from a database of pre-contoured CT's, a best match to the patient simulation CT. It then performs a deformable registration of the selected contour to better match the patient anatomy between the two CTs. AABS algorithms can function using a single best match (SBM), where only one precontoured CT is used from a database, or a multiple best match (MBM), where a number of best matching contours are retrieved and combined to generate the contour using an algorithm.
The performance of AABS is a focus of research in the field of radiation oncology. Previous studies have demonstrated that AABS decreases inter-and intra-observer variability as well as contouring time in multiple cancer types [1,[3][4][5]. Yet, some research suggests the need for further improvement of AABS approaches, as illustrated by Hwee, et al. where they found that only 12% of their auto-contoured images were considered clinically acceptable by blinded human observers [1]. Previous studies have not characterized, in particular, any differences between single best matching (SBM) and multiple best matching (MBM) approaches, or differences between contours when the number of best matches is varied when using MBM. Thus, the objective of this study is to investigate the potential for improvement of automated contours using commercially available contouring software that features AABS with MBM capabilities.

Materials And Methods
Five pelvic CT simulation datasets (512 x 512 pixels, 3mm slice thickness, 120 kVp) of five different prostate bed patients were each contoured by an expert panel of five radiation oncologists [1]. The six structures specifically delineated were prostate bed, rectum, bladder, penile bulb, and left and right femoral heads. A consensus contour for each structure was generated using the simultaneous truth and performance level estimate (STAPLE) algorithm [6]. The STAPLE algorithm estimates the true volume of a structure from a collection of observer contours as inputs. The STAPLE consensus contours were taken as the gold standard for investigational (automated) contours to be compared against.
A previously developed atlas database [1] was used for AABS auto-contouring. Commercially available software (MIM Software Inc, Cleveland OH, USA) was used to perform AABS, since it features not only SBM but also a 'multi-atlas' tool that allows MBM. In the case of MBM, the software generates the final segmentation from the volume of overlap between at least half of the indexed contours (for example two of three, two of four, three of five, etc.) ( Figure 1). In this study, MBM of up to 10 best matches was explored. For each of the five patients and for each of the six structures, 10 AABS contours were generated by ranging from one to 10 best matches. Thus, a total of 300 AABS contours (10 AABS x six structures x five patient datasets) were generated and compared against the six STAPLE consensus contours to generate study datapoints. StructSure software (StructSure TM, Standard Imaging Inc., Middletown, WI, USA) was used to calculate Dice similarity coefficient (DSC) [6]. The DSC is defined as: where V is the volume within a contour given by a single observer, Vc is volume within the consensus contour, and denotes the volume of overlap between the two contours [7]. Since DSC is a coefficient, the results are logit-transformed prior to statistical analysis to ensure normality [8]. ANOVA testing was used to estimate statistical significance of any correlations between the number of best-matches and logit (DSC). Table 1 summarizes the results of mean and standard deviation DSC scores, averaged over the five patients, for each structure and number of best matches from one to 10. Bladder and penile bulb show a statistically significant improvement in DSC score which gradually improves as the number of best matches are increased, from DSC of 0.73 with one best match to 0.82 with 10 best matches for bladder (p < 0.001), and from a DSC of 0.40 with one best match to 0.54 with 10 best matches for penile bulb (p = 0.047). The rectum also showed an observed improvement from 0.56 to 0.67, but this finding was not found to be statistically significant (p = 0.509). The remaining structures did not show improvement as the number of best matches   Two factors, which may contribute to failure of AABS, are variability of anatomy between patients and poor contrast of structures and their background on CT. AABS relies on similarity between patient anatomies, so that a pre-contoured CT can be used to closely approximate the current structure for automated contouring. Large databases of atlases are used to increase the chances of a best match being similar to the current patient's anatomy. Deformable registration is also performed to improve the match between CT scans. Yet, structures that have poor contrast pose problems for deformable registration algorithms, which rely on CT contrast differences. Both of these factors can be structure-specific. Thus, it is expected that AABS will have variable success depending on the features of the structure involved. High-contrast, consistently shaped structures are likely to be well-suited to AABS techniques, whereas low-contrast structures with variable anatomy are more likely to be poorly suited.

Discussion
The results of this study show a benefit of using MBM in relation to contouring of certain structures. In particular, the bladder and penile bulb contours demonstrated a marked improvement with increased best match number. The MBM approach was able to improve the DSC of penile bulb but not of prostate bed. The femoral heads had the highest DSC scores, which were achieved even with SBM. The high DSC score of femoral heads can be attributed to their relatively consistent shape between different patients, as well as their very high degree of contrast. The bladder also has a high contrast border, which may explain the relatively high DSC scores of this organ. Yet, the bladder can have a somewhat variable shape, which may explain why the bladder DSC improves gradually as best match number is increased. In contrast, the prostate bed is highly variable from patient to patient, and is also a relatively low-contrast target with ill-defined borders. This may have contributed to the poor DSC scores of prostate bed in this study as well as in previous AABS studies.
There are several limitations to the study. First, only one contouring algorithm was investigated. The overlapping algorithm used to combine the MBM contours into a single contour was one of many possible MBM approaches. The results described in this manuscript are not necessarily generalizable to other AABS software solutions that use different contouring algorithms. Other new algorithms for multi-atlas contouring are emerging and show promising improvements in accuracy [9][10][11]. Furthermore, the number of atlases in the database was fixed but large. It is not clear whether the contours could be further improved by increasing the number of pre-contoured CT datasets in the AABS library. Another limitation that is present in this study is the uncertainty of the gold standard. As with all studies investigating contouring variability, the definition of a true gold standard is challenging. In this study, the gold standard was generated using the STAPLE algorithm. This approach may reduce the subjectivity of gold standard contour definition. Yet, it is still unclear exactly how gold standard contours are to be best generated in these studies.

Conclusions
Future work includes the identification of other structures that could benefit from a MBM approach. Pirozzi, et al. recently found that a multi-atlas approach for lung cancer resulted in significantly more accurate contours than compared to a single best matched index [10]. The organs in that study included esophagus, spinal cord, heart, left lung, right lung, and trachea. However, it did not specify whether the improvement was seen in all of these structures or just a select few. Future work will also focus on investigation into the effect of increasing the size and/or contents of the AABS library on observed Dice coefficients between automated contours and clinical gold standards.