Visual Abstract
Abstract
18F-FDG PET/CT quantification of whole-body tumor burden in lymphoma is not routinely performed because of the lack of fast methods. Although the semiautomatic method is fast, it is not fast enough to quantify tumor burden in daily clinical practice. Our purpose was to evaluate the performance of convolutional neural network (CNN) software in localizing neoplastic lesions in whole-body 18F-FDG PET/CT images of pediatric lymphoma patients. Methods: The retrospective image dataset, derived from the data pool of the International Atomic Energy Agency (coordinated research project E12017), included 102 baseline staging 18F-FDG PET/CT studies of pediatric lymphoma patients (mean age, 11 y). The images were quantified to determine the whole-body tumor burden (whole-body metabolic tumor volume [wbMTV] and whole-body total lesion glycolysis [wbTLG]) using semiautomatic software and CNN-based software. Both were displayed as semiautomatic wbMTV and wbTLG and as CNN wbMTV and wbTLG. The intraclass correlation coefficient (ICC) was applied to evaluate concordance between the CNN-based software and the semiautomatic software. Results: Twenty-six patients were excluded from the analysis because the software was unable to perform calculations for them. In the remaining 76 patients, CNN and semiautomatic wbMTV tumor burden metrics correlated strongly (ICC, 0.993; 95% CI, 0.989 − 0.996; P < 0.0001), as did CNN and semiautomatic wbTLG (ICC, 0.999; 95% CI, 0.998–0.999; P < 0.0001). However, the time spent calculating these metrics was significantly (<0.0001) less by CNN (mean, 19 s; range, 11–50 s) than by the semiautomatic method (mean, 21.6 min; range, 3.2–62.1 min), especially in patients with advanced disease. Conclusion: Determining whole-body tumor burden in pediatric lymphoma patients using CNN is fast and feasible in clinical practice.
For pediatric staging and treatment response evaluation of Hodgkin and non‐Hodgkin lymphoma, 18F-FDG PET/CT is an invaluable tool and an established modality (1–7). Visual interpretation of 18F-FDG PET/CT studies to assess the extent of disease can be subjective; therefore, qualitative interpretation is necessary to provide additional insight, reducing the subjectivity of visual interpretation (8,9). 18F-FDG PET/CT whole-body metabolic tumor burden parameters such as metabolic tumor volume (MTV) and total lesion glycolysis (TLG) bear a high prognostic value in lymphoma patients, much greater than SUVs (10–13). However, the prognostic determination, although easily measured in primary solid tumors (14–17), has not been applied in daily clinical practice to patients with widespread lymphoma disease because calculations are extremely time-consuming.
There is a wide variety of methods to quantify MTV and TLG, using threshold-based or algorithm-based methods. Specifically relating to the threshold-based methods, the most commonly applied is the volume-of-interest (VOI) isocontour method (15,17,18). Automatic multifocal segmentation quantification in patients with lymphoma uses VOI isocontour and has been validated before and proven to be quite fast (19). Depending on patient tumor burden, the time spent calculating MTV and TLG could be impractical and still not feasible in daily clinical practice. The extraction and processing of imaging features from radiologic data, also known as radiomics, may also link imaging features with patient outcome. However, radiomics also requires precise tumor ROI delineation, which is also time-consuming, with delineation variabilities between observers.
Computer deep learning and functioning as a neural network have evolved substantially, achieving remarkable success in tumor segmentation and diagnosis and ultimately transforming and optimizing clinical practice (18,20–23), providing objective and accurate diagnoses in medicine by building diagnostic models. For example, software for multimodality imaging using deep convolutional neural networks (CNNs) automatically localizes and delineates metastases in whole-body 18F-FDG PET/CT scans. Deep CNN seems capable of correctly localizing and classifying uptake patterns in 18F-FDG PET/CT images into foci suggestive and nonsuggestive of cancer. These extracted features help the semantic interpretation and may simplify the PET workflow with a 1-click calculation of whole-body tumor burden (24–26). However, the clinical applicability of this software has not yet been fully tested, and unusual features may be identified if unsupervised by a physician (27,28).
The purpose of this study was to evaluate the performance of the recently developed CNN software in a clinical setting in pediatric lymphoma patients.
MATERIALS AND METHODS
This dataset, retrospectively studied, is derived from a subset of 102 baseline staging 18F-FDG PET/CT studies of pediatric lymphoma patient images from the data pool of the prospective multicenter research project coordinated by the International Atomic Energy Agency (coordinated research project E12017).
Research Regulation and Data Protection
The study protocol was approved by each center’s Institutional Review Board. A signed parental consent form was an inclusion criterion for recruitment, and all subjects gave such consent. Cases and forms were anonymized to ensure confidentiality while sharing data internationally.
Patients
The eligibility criteria consisted of pediatric patients (age < 18 y) with newly diagnosed Hodgkin lymphoma or non-Hodgkin lymphoma who underwent a staging 18F-FDG PET/CT scan. According to the World Health Organization classification criteria, the diagnosis was based on biopsy with immunohistochemistry (29). Exclusion criteria consisted of prior radiation therapy and chemotherapy and concurrent HIV infection.
The patient’s clinical characteristics and tumor stage were evaluated, such as the age at diagnosis, the final clinical stage, spleen disease, additional nodal sites, disease volume, B symptoms, lactate dehydrogenase level, leukocytosis, erythrocyte sedimentation rate, anemia, albumin level, bone marrow 18F-FDG uptake, Deauville score, MTV, and TLG.
18F-FDG PET/CT Imaging and Quantification
All patients underwent staging whole-body 18F-FDG PET/CT, from the top of the skull to the toes. All scans were obtained according to standard Society of Nuclear Medicine and Molecular Imaging or European Association of Nuclear Medicine procedure guidelines (30).
The whole-body MTV (wbMTV) and whole-body TLG (wbTLG) metrics were calculated using semiautomatic and CNN software. All images on both types of software were processed by 2 observers. Differences in the wbMTV and wbTLG metrics (if any) were recalculated to reach consensus. The semiautomatic software was used as the reference standard to evaluate the CNN software’s performance.
Semiautomatic Quantification of Whole-Body Tumor Burden
The wbMTV and wbTLG metrics were calculated using semiautomatic multifocal segmentation software (Syngovia VB20; Siemens Medical Solutions), previously validated for clinical use (19) using a fixed threshold.
With this software, the whole-body tumor burden metrics (semiautomatic wbMTV and wbTLG) were obtained. The semiautomatic whole-body tumor burden was obtained by choosing the multifocal segmentation tool that automatically draws a rectangular VOI around the patient’s entire body on the coronal axis. If necessary, the VOI is adjusted in the axial and sagittal planes. The liver is set as the background reference, and then volumes of interest are automatically determined surrounding each lymphoma lesion with uptake higher than the SUVmean of the liver. A VOI threshold of 41% of the SUVmax using isocontour drawings was applied for all automatically delineated lesions. The image and VOIs were then reviewed to exclude physiologic areas incorrectly selected as cancer (such as brain, kidneys, bladder, and ureters) and include metastatic foci with relatively low uptake that were missed by the software (e.g., small lymph nodes). Afterward, whole-body MTV and TLG calculations were readily available and displayed as semiautomatic wbMTV and wbTLG (Fig. 1).
Whole-body tumor burden quantification on baseline staging 18F-FDG PET/CT using semiautomatic software on patient with non-Hodgkin lymphoma. (A) Maximum-intensity projection shows hypermetabolic lymphoma infiltration in left supraclavicular and cervical lymph nodes, mediastinal lymph nodes, and extensively in abdominopelvic lymph nodes; lung nodules; and bone infiltration. (B) For calculation, liver is set as background reference, and VOIs automatically surround each lymphoma lesion with uptake higher than SUVmean of liver. VOIs also include physiologic areas incorrectly selected as cancer to include metastatic foci with relatively low uptake, such as lung nodule metastasis with mild 18F-FDG uptake in right upper lobe.
CNN Quantification of Whole-Body Tumor Burden
The wbMTV and wbTLG metrics were calculated using software based on deep CNN (Syngovia VB50; Siemens Healthineers). With this software, the whole-body tumor burden metrics (CNN wbMTV and wbTLG) were obtained.
Computation of the whole-body tumor burden on the CNN software was automatically performed by the deep CNN method as described by Sibille et al. (24). Unlike the semiautomatic software, the CNN software does not require an initial positioning of a VOI surrounding the body. The CNN automatically computes the maximum-intensity-projection 18F-FDG PET image and integrates the anatomic CT image using an intuitive interface. Afterward, the software automatically detects 18F-FDG–avid anatomic landmarks and discriminates hypermetabolic areas related to the physiologic activity that will be automatically excluded (Fig. 2) from cancer. Briefly, the PET VOIs are segmented using a fixed threshold algorithm and evaluated by the deep CNN. Whole-body CT examinations are aligned to an anatomic atlas. Finally, a maximum-intensity projection of the whole-body 18F-FDG PET/CT is reconstructed, and the lesions are classified. The deep CNN uses a combination of multiplanar reconstructions of PET and CT, 18F-FDG PET maximum-intensity projections, and anatomic atlases to predict the anatomic localization of 18F-FDG foci and determine whether a focus was suggestive (or not) for malignancy. The advantage of the CNN algorithm is that it does not require the initial positioning of a VOI. This specific CNN software is not yet validated for pediatric patients.
Whole-body tumor burden quantification on staging 18F-FDG PET/CT using CNN. Displayed in red are regions that software excluded from analysis (regions related to physiologic uptake: brain, head and neck, heart, intestines, kidneys, and bladder), and displayed in green are regions that software included in calculation of whole-body tumor burden. In this patient, extensive cervical lymph node bulky mass and mediastinal lymph nodes were included.
Two forms of analyses were undertaken on the CNN software: the observer method, in which all VOIs automatically generated by the multifocal segmentation tool were reviewed (in a masked manner) by both observers to determine whether the VOIs were wrongly included or excluded from the results (afterward, values were calculated and displayed as CNN + observer wbMTV and wbTLG), and the no-observer method, in which the VOIs automatically obtained were accepted and did not undergo a masked review by each of the observers. The calculations were readily available and displayed as CNN wbMTV and wbTLG.
Statistical Analysis
The sample was characterized by descriptive analysis, performed using frequency tables for categoric variables and measures of position and dispersion for continuous variables (mean, SD, median, minimum and maximum).
The χ2 test or Fisher exact test was used to check associations or compare proportions, and the Mann–Whitney test was used to compare continuous or orderable measurements between the 2 groups. Risk factors associated with the event were identified with univariate and multiple Cox regression analyses. The variable selection process used was stepwise.
To verify the relationship between continuous measurements, the Spearman correlation coefficient was used ranging from −1 to 1.
To assess agreement between the semiautomatic and CNN software, the intraclass correlation coefficient (ICC) was used (values above 0.7 were considered to indicate substantial agreement). The Friedman test and the Wilcoxon test for related samples were used to compare the times. The time was defined as the moment that the physician began focusing on the task until the moment that the whole-body tumor burden calculation was completed. The level of significance was 0.05.
RESULTS
The whole-body tumor burden was quantified using both types of software in 102 18F-FDG PET/CT baseline scans of pediatric lymphoma patients. There were 32 (31.4%) girls and 70 (68.6%) boys. The mean age at lymphoma diagnosis was 11.1 ± 4.3 y (range, 4.0–18.0 y). Among these, 80 (78.4%) patients had Hodgkin lymphoma, and 22 (21.6%) had non-Hodgkin lymphoma. Table 1 displays the clinical characteristics.
Clinical Characteristics of Patients (n = 102)
Semiautomatic Calculation of Whole-Body Tumor Burden
The semiautomatic wbMTV and wbTLG were calculated in all 102 patients. The average time spent on this calculation was 21.6 min, ranging from 3.2 to 62.1 min. Notably, in patients with widespread lesions in multiple organs or confluent with areas of physiologic excretion, the software took longer to identify and delineate abnormal areas.
CNN-Based Calculation of Tumor Burden
The CNN + observer wbMTV and wbTLG were also calculated in all 102 patients. The average time spent on this calculation, with the CNN software having the observers evaluate the images before calculation, was 3.8 min, ranging from 0.5 to 19.6 min.
On the other hand, CNN wbMTV and wbTLG (i.e., without any observer evaluating the CNN software’s performance before calculation) were calculated in 76 of the 102 patients. Twenty-six patients were excluded from the analyses because the software could not perform calculations because of patient movement or misregistration (n = 6), because the software could not recognize small lymph nodes as diseased (n = 8), or because there was widespread brown fat (n = 3), diffuse bone infiltration (n = 5), diffuse homogeneous mild infiltration of the spleen (n = 2), or subcutaneous infiltration of 18F-FDG at the injection site (n = 2) (Fig. 3).
Baseline staging 18F-FDG PET/CT of patient with Hodgkin lymphoma. (A) Maximum-intensity projection reveals cervical hypermetabolic bulky mass. (B) Image displayed with different whole-body tumor burden quantification methods shows that using semiautomatic method, VOIs are delineated in cancer lesions and also in physiologic regions not related to cancer; these regions must be deleted before quantification. Whole-body tumor burden calculation showed semiautomatic wbMTV of 104 and TLG of 1,663; time spent calculating these metrics was 5 min. (C) CNN whole-body tumor burden quantification does not delineate regions nonrelated to cancer and demonstrates similar metrics: CNN + observer wbMTV of 105 and CNN + observer wbTLG of 1,671. Time spent calculating was significantly less (13 s) even though CNN software failed to delineate spleen, which had to be performed manually.
Impressively, the average total time spent calculating CNN wbMTV and wbTLG was 19 s, ranging from 11 to 50 s. This total time begins when the physician begins focusing on the task and ends at completion of the whole-body tumor burden calculation. Thus, the times spent calculating CNN, CNN + observer, and semiautomatic wbMTV metrics in 76 paired patients were significantly different (P < 0.0001). The CNN software alone was much faster and more precise than either the semiautomatic or the CNN + observer method (Table 2).
Time Spent Quantifying Whole-Body Tumor Burden Metrics on Semiautomatic Software and CNN Software With and Without Observer Input
Comparison of Semiautomatic and CNN Tumor Burden Measurements
The CNN + observer and semiautomatic wbMTV metrics calculated on the 102 patients correlated strongly (ICC, 0.993; 95% CI, 0.989–0.996; P < 0.0001), as did the CNN + observer and semiautomatic wbTLG metrics (ICC, 0.999; 95% CI, 0.998–0.999; P < 0.0001). Among the 76 18F-FDG PET/CT studies in which the fully automatic CNN was performed, the CNN + observer, CNN, and semiautomatic wbMTV metrics also correlated strongly, as did the CNN + observer, CNN, and semiautomatic wbTLG metrics (Table 3).
Correlation of Whole-Body Tumor Burden Metrics on Semiautomatic Software and CNN-Based Software With and Without Observer Input in 76 Patients
Impressively, the correlation between CNN and semiautomatic wbMTV was significantly high (ICC, 0.950; 95% CI, 0.922–0.968; P < 0.0001), as was CNN and semiautomatic wbTLG (ICC, 0.947; 95% CI, 0.917–0.966; P < 0.0001). Therefore, the CNN software performed equally well, similar to the semiautomatic tool in which an experienced observer evaluated the images.
More impressive, however, was the fact that the correlation between CNN + observer and CNN wbMTV was significantly high (ICC, 0.946; 95% CI, 0.912–0.966; P < 0.0001), as was CNN + observer and CNN wbTLG (ICC, 0.952; 95% CI, 0.925–0.969; P < 0.0001). Consequently, the CNN software performance did not require an observer to evaluate the images and validate all VOIs.
DISCUSSION
To our knowledge, this was the first study to quantify the whole-body tumor burden of pediatric lymphoma patients using CNN and deep learning. Despite the difference in 18F-FDG biodistribution between children and adults, the CNN-based software accurately delineated abnormal regions. The CNN-based software optimized the working time, was extremely fast, and performed better than the semiquantitative software in calculating whole-body tumor burden.
The CNN-based software allows a review of the VOIs provided automatically (i.e., VOIs can be added manually or incorrect ones deleted). Ultimately comparison of the CNN-based software with and without the observer’s review of the VOIs rendered the same metrics. However, the time spent determining the whole-body tumor burden metrics by the semiautomatic software was longer, because it depends primarily on the extent of the disease. The semiautomatic quantification does not allow preselection of VOIs by the operator before creating the definitive findings and thus does not distinguish diseased areas from physiologic areas, creating many VOIs that overload the program.
On the other hand, quantifying the whole-body tumor burden through CNN-based software was significantly faster, with and without the observer reviewing the VOIs. Impressively, when we compared quantification of the whole-body tumor burden on the CNN-based software (without observer interference) with the semiautomatic software and CNN-based software with observer interference, CNN-based software without the interference of the observer was significantly faster and just as precise. CNN-based software took as little as 20 s to calculate the patient’s entire tumor burden, without the need to review the VOIs (Figs. 4 and 5).
Baseline staging 18F-FDG PET/CT of Hodgkin lymphoma. (A) Maximum-intensity projection reveals mediastinal hypermetabolic bulky mass and extensive infiltration of cervical lymph nodes, abdominal lymph nodes, and spleen. (B) Semiautomatic quantification reveals semiautomatic wbMTV of 548 and semiautomatic wbTLG of 5,238; time spent calculating was 15 min. (C) CNN whole-body tumor burden quantification demonstrates similar metrics: CNN wbMTV of 570 and CNN wbTLG of 5,213, but time spent calculating was significantly less (14 s). CNN software excludes focal areas of physiologic uptake such as right ureter and includes areas of mild uptake such as left hilar lymph node.
18F-FDG PET/CT of patient with Hodgkin lymphoma. (A) Maximum-intensity projection reveals mediastinal hypermetabolic bulky mass and cervical, axillary, and inguinal nodes. (B) Semiautomatic VB20 whole-body tumor burden quantification reveals MTV of 194 and TLG of 1,007; time spent calculating these metrics was 30 min because of extent of lesions and need to exclude multiple areas of physiologic uptake. (C) CNN whole-body tumor burden quantification demonstrates similar metrics: MTV of 200 and TLG of 968. However, time spent calculating was significantly less (36 s). CNN software excludes physiologic areas with high uptake such as heart and includes lymph nodes with less uptake adjacent to heart.
However, there were some limitations. It was not possible to show whether the measurements predicted by the CNN-based software could be applied to our patient cohort to predict prognosis and response evaluation. Most (78.4%) of the patients had Hodgkin lymphoma, and there were only 2 deaths; therefore, it was not possible to determine overall survival. A larger number of patients with events are required to determine whether the measurements predicted by CNN-based software can predict prognosis. Another limitation is that 25% of the patients were excluded from analyses because the CNN-based software could not recognize areas of metabolically active disease and could not perform calculation. In such situations, these patients had to be excluded because there was no ability to compare CNN quantification with manual or semiautomatic quantification. The CNN software we tested was not initially designed or validated to quantify specifically pediatric patients but, even so, performed quite well. These exclusions were caused by either the wrong lesion being segmented or lesions being missed. For example, small lymph nodes with mild 18F-FDG uptake were excluded; extensive brown fat was erroneously included as lymphomatous infiltration; extensive diffuse bone marrow infiltration (5/12 patients) was missed; and radiopharmaceutical extravasation sites and bladder catheter were erroneously included. Most likely, with further CNN and deep-learning development and specific training in pediatric patients regarding differentiation of normal biodistribution from cancer tissue, failure rates will decrease.
CNN-based software with CNN and deep learning still requires the input of the observer (26–28). In 25% of the patients, CNN could not depict the correct neoplastic tissue or added nonneoplastic tissue; thus, quantification had to be excluded because the software was not performing the calculations. Therefore, errors and failure to detect proper tissue will occur even in CNN and DL software, arguing in favor of the observer input. Most likely, the largest errors may be associated with unsupervised quantification.
CONCLUSION
CNN-based quantification of whole-body tumor burden in pediatric lymphoma patients is an emerging field. Determination of whole-body tumor burden using CNN-based software is extremely fast and feasible in clinical practice in pediatric lymphoma patients. CNN-based software requires CNN and deep-learning development and specific training in pediatric patients, as well as the input of the observer to minimize failure rates. Tumor burden should be evaluated in most if not all tumors and age groups for therapy purposes.
DISCLOSURE
The whole-body metrics were calculated using a loaned Siemens device equipped with a software based on deep CNN (Syngovia VB50).
KEY POINTS
QUESTION: Will the use of CNN promote fast and reliable quantification data regarding whole-body metabolic tumor burden in 18F-FDG PET/CT pediatric lymphoma patients?
PERTINENT FINDINGS: Quantification of whole-body metabolic tumor burden using CNN correlates strongly with semiautomatic quantification (ICC, 0.993; 95% CI, 0.989 − 0.996; P < 0.0001).
IMPLICATIONS FOR PATIENT CARE: In addition to reliable data, implementation of CNN quantification tools in clinical practice may be able to quickly and accurately deliver prognostic information for better patient management.
ACKNOWLEDGMENT
We thank Cleide Silva from the Statistics Research Department of the University of Campinas for her invaluable help.
Footnotes
Published online Apr. 19, 2022.
REFERENCES
- Received for publication July 14, 2021.
- Accepted for publication February 10, 2022.