Impact of Contouring Variability on Dose- Volume Metrics used in Treatment Plan Optimization of Prostate IMRT

Background and Purpose: Contouring variability remains a major source of uncertainty in radiotherapy treatment planning. The objective of this study was to identify the effect of contouring variability on dose-volume histogram (DVH) metrics used for treatment plan optimization of prostate IMRT. Methods: A total of 25 observers were recruited to delineate the bladder, prostate, and rectum in a CT scan of low-risk prostate cancer. Dice similarity coefficients (DSC) were calculated between observer and an algorithmically-generated consensus contour. The observer contours were used to generate treatment plans and calculate DVH for each organ. The variance between DVH curves was calculated for the values D95% for prostate, and V65, 70, 75 Gy for bladder and rectum. Results: DSC for the bladder, prostate, and rectum were 0.971 ± 0.007, 0.838 ± 0.067, 0.771 ± 0.124, respectively. DVH variance for all three structures was primarily driven by differences in prostate contouring. Variations in rectal contouring had important additional impacts only on rectal DVH. Bladder contouring variation had little impact on DVH metrics. Conclusions: Although the rectum was the most inconsistently contoured structure, its variability did not impact DVH as much as prostate variability. It has been demonstrated that the dosimetric impact of contouring variability cannot be predicted solely with DSC. Categories: Medical Physics, Radiation Oncology


Introduction
Modern radiation therapy delivers highly conformal dose distributions. With improving technological capability, an increased need for accurate delineation of the planning target volumes (PTV) and organs at risk (OAR) is required. Inter-clinician variability during contouring remains a substantial source of uncertainty. Contouring variability has therefore become a focus of research, with emphasis on both PTV and OARs for many anatomical sites [1- 4]. The impact of contour variability on dose-volume histogram (DVH) parameters used in treatment plan optimization has been less well-characterized.
DVH summarizes dose distribution across an entire organ or target volume of interest and is sensitive to variations in contouring. Previous work has identified DVH differences due to contouring variation in breast, oropharyngeal and prostate cancer [5][6][7], but these studies did not examine the independent contribution of contouring variability of individual structures to DVH variations. For example, it remains unclear how contouring variability, specifically of the bladder, affects the DVH of the prostate. The dose optimization algorithm seeks to deposit as much dose in the PTV, while minimizing dose deposition in OARs delineated. The interplay of PTV and OAR contours therefore requires consideration because treatment plan optimization is driven collectively by the competing parameters of dose to target and normal tissue toxicity.
We examined the effect of individual and combined contouring variability on OAR and PTV DVH for a low-risk prostate cancer case contoured by multiple observers. In doing so, we sought to isolate the effects of variability in contouring individual and grouped structures on the DVH for all the structures of interest and potentially identify sources where efforts to reduce variability in contouring may have the largest impact on plan optimization results as characterized by deviations in key DVH parameters.

Contouring collection and comparison of similarity
An online contouring challenge of the prostate, bladder, and rectum was completed on an anonymized low-risk prostate cancer abdominal-pelvic CT data set (120 kV P , 512x512 pixels/slice, 3-mm slice thickness) DICOM-RT structure files containing 25 unique contours of prostate, bladder, and rectum were obtained using a multi-institutional online program (www.contouringchallenge.com). In agreeing to participate in the challenge, observers provided the requested contours and consented to the use of the contours for research analysis.
To compare the contours for individual observers, a "gold standard" contour is required ( Figure  1). For this study, the gold standards used were consensus contours created for each of the three structures using the Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm [8] that estimates the true volume of a structure from a collection of observer contours.

FIGURE 1: "Gold standard" contour
The 25 observer contours (prostate, rectum, bladder) were each compared against the corresponding consensus contour with StructSure software (StructSure TM, Standard Imaging Inc., Middletown, WI, USA). StructSure imports DICOM-RT structure files and performs measurements of similarity between a structure file identified as the "gold standard" and a comparison "test" structure file. For this study, we used the Dice similarity coefficient (DSC) [9] to compare observer contours against the consensus contour obtained by the STAPLE algorithm.

Creation of contour series isolating variability in a single structure
Differences in the DSC do not necessarily reflect potential differences in treatment planning dose effects. An objective of this study was to determine the individual contribution of prostate, bladder, and rectum contouring variability to the DVH deviations of these three contours superimposed on the consensus dose distribution. For this purpose, we have designed four contour series for investigation, corresponding the variation of a different structure's contour. The first contour series includes the 25 contour sets of all observers (Vary All). The second series was created by importing only the 25 observer contours for the prostate combined with the STAPLE contours of the bladder and rectum (Vary Prostate Only). The third and fourth series were created by importing the observer contours for bladder (Vary Bladder Only) and for rectum (Vary Rectum Only) combined with STAPLE contours of the remaining two organs. These contour series were investigated with the treatment plan optimization protocol as outlined below. The contour series were used in a treatment plan optimization protocol. To avoid bias, an automated class solution was applied to all contour sets in the four contour series described above using the scripting and plug-in utilities in Pinnacle software (version 8.1y). The IMRT plan was designed to use five fields of 18 MV x-rays targeting a PTV (GTV, plus a uniform margin of 10 mm except 7 mm posteriorly) with 76 Gy to the normalization point (isocentre), and a consistent set of objectives and weights. Using this solution, an optimized reference plan containing the STAPLE gold standard contours was generated and approved by two radiation oncologists, guided by RTOG-0415 standards (Radiation Therapy Oncology Groupwww.rtog.com).

Treatment plan optimization and DVH curves
Using the same class solution developed for the reference plan based on STAPLE consensus contours, another set of treatment plans was generated for each test contour series. The resulting dose distributions were superimposed onto the STAPLE consensus contours to determine the dose that would have resulted in the "true" structures. To be clear, the dose distributions were optimized using test contours, but the dose-volume analysis was based on the "gold standard" consensus contour set. Potential geographic misses of the "true" target or dose spillage into normal structures were thus assayed. The DVH data for these standard structures subjected to test dose exposure were exported for analysis of variance.

Statistical analysis
For prostate, the variance (σ 2 ) of DVH curves between the 25 observers was characterized at the 95% dose level (D 95% ). Thus, the % volume was kept fixed, and the variance in Gy of the 25 observer DVH curves was measured (a horizontal line sampling the DVH curves). The curves were a priori sampled from D 92.5% to D 97.5% at increments of 0.2 %, resulting in 25 measurements of σ 2 from which the mean and standard deviation of σ 2 was calculated for the region D 95 + 2.5% . For rectum and bladder, the dose is kept fixed, and σ 2 of the % volume between the 25 observer DVH curves was calculated (vertical line sampling). For rectum and bladder, σ 2 was a priori characterized at three different regions of the curve: V 65 + 2.5 Gy , V 70 + 2.5 Gy , V 75 + 2.5 Gy sampling at increments of 0.2 Gy.
To test for differences in variance (in volume or in dose depending on which structures were varied), the Paired T-test was used, comparing the following DVH pairings: All vs. Prostate Only, All vs. Bladder Only, All vs. Rectum Only. All statistical analysis was performed using SAS software (version 9.2), using two-sided statistical testing at the 5% significance level. Figure 2 shows a representative central slice of the CT study with all observer contours superimposed. The mean and standard deviation of DSC for observer contours compared to the respective STAPLE gold standard contours were 0.971 ± 0.007 (bladder), 0.838 ± 0.067 (prostate) and 0.771 ± 0.124 (rectum). Figures 3A-3D illustrates the superimposed DVHs of treatment plans when varying all contours (A), as well as when varying the contours of prostate, bladder, and rectum individually (B-D). From this figure, qualitatively there is relatively high dispersion of DVH curves for all three structures when all contours are varied or the prostate contour only is varied. Dispersion in the DVH curves of the rectum also occurs when the rectum contours are varied. Little to no dispersion occurs when bladder contours are varied. Tables 1-3 quantitatively summarize the variance statistics of the relevant bladder (Table 1), rectum ( Table 2), and prostate (      Collectively, both the qualitative and quantitative DVH analyses demonstrate that the variance in prostate DVH is primarily driven from differences in prostate contouring and that differences in rectum and bladder contouring have less impact on prostate metrics variations. Observed rectal DVH variation is primarily driven by differences in rectal contouring as well as prostate contouring, whereas bladder contouring variation did not have a similar impact. In terms of the bladder DVH variation, the differences in the contouring of the prostate (and not the bladder itself) were primarily responsible. The 95% confidence intervals mostly show statistically significant differences in DVH variance between the DVHs of the 'vary all' series compared to the 'vary bladder only', 'vary rectum only', and 'vary prostate only' series. Two comparisons did not show statistical significance: the bladder DVH variance and the prostate DVH variance comparisons of the "vary all" to the "vary prostate only" DVH series. This demonstrates that varying all contours together as a group did not significantly change the DVH variance of prostate or bladder DVH when compared to varying only the prostate contour.

Discussion
The DSC metric is commonly used in studies of contouring variability [10]. In this study, DSC demonstrated the highest compliance value for bladder (0.971), followed by prostate (0.838), and then rectum ( 0.771) There are several possible reasons for the high similarity among bladder contours compared to other contours. Firstly, the bladder is the largest structure of the three, meaning that variability itself must be larger to influence DSC significantly. Furthermore, the bladder has a relatively well-defined border observed in CT imaging, facilitating delineation. In comparison, the rectum is generally smaller with less well-defined borders, particularly at the prostate boundary and inferiorly towards the pelvic diaphragm and anal canal.
Remarkably, the dosimetric hierarchy observed from DVH results is not in agreement with the DSC results. It may have been suspected that the structures that produce the lowest DSC values would cause the greatest variability in the dosimetric parameters. Although bladder had the highest DSC score as well as the least variable DVH results, the remaining two structures do not show correlation between DSC score and DVH impact. The rectum had the lowest DSC score, and yet the prostate contributed the most to DVH variability, despite having a higher DSC score than the rectum. Furthermore, varying only rectum contours caused notable "isolated" variability in DVH of only rectum, whereas varying only prostate contours caused notable generalized variability in DVH for prostate, rectum, and bladder. The true dosimetric impact of contouring variability in this case therefore could not have been predicted by assaying simple similarity measurements alone. This decoupling of dosimetric effect and contouring variability metrics suggests that future strategies of reducing contouring variability should seek to reduce dose variation impact, not just contour compliance. There is no one-to-one correspondence between contour and dose volumes. There is a non-linear interplay between dose optimization, dose deposition, and contour topology.
There are a number of possible reasons for the different contributions of each structure to DVH variability. Firstly, the optimization of treatment plans ranked the prostate voxels as dominant over voxels of either OAR through a weighting assignment. Thus, the prostate volume contributes more heavily to driving the dose optimization algorithm towards a large uniform dose to the prostate. Secondly, the prostate is geographically the central structure, since it is the target. It is in close proximity with two OARs, which may contribute to its high level of contribution to the treatment plan optimization.
Most previous studies of contouring variability have not "followed through" on dosimetric impact and have mostly only provided measurements of similarity, such as DSC. Studies that have demonstrated DVH results [5][6][7] have shown that contouring variability can lead to dosimetrically relevant treatment planning variability, although a previous prostate study [7] suggests that contouring variability in prostate cancer may not have large dosimetric impacts.
Some notable limitations exist in this study. Firstly, as with all studies involving a contouring challenge, the definition of a true gold standard is difficult. In this study, the gold standard was generated using the STAPLE algorithm to achieve a consensus representative contour. This approach reduces the subjectivity of a gold standard contour definition by a panel of experts.
Another limitation was the five-field IMRT treatment planning process that was used. Although we used a consistent set of plan optimization parameters and a class solution that generally produces clinically acceptable plans, no attempt was made to re-optimize individual plans based on the test contours beyond the class solution. Other centres, or other operators, may use different selections and approaches. Thus, the results of this study must be interpreted within the context of a fixed treatment plan protocol.
Future work could include repeating similar experiments for other tumour sites, deriving hierarchies for organ sets in regions other than pelvis where only similarity measurements have been performed to date. Furthermore, the analytic approach used in this study could also be used in the validation of automated contouring techniques.
There is known inter-and intra-fraction variability in organ volume and position during radiotherapy. This contributes to dosimetric variability during actual treatment and is influenced by other observations, such as interpretation of in-room image guidance information used in patient setups or beam gating. Placing the contouring effects noted here into the context of overall uncertainties in prostate cancer planning and delivery will be important in order to identify weakest links in the planning-delivery chain where future effort should be concentrated. For example, would perfectly congruent OAR and PTV contouring make a significant dosimetric difference, given the other downstream sources of uncertainty that would still persist?