Federated Machine Learning Enables the Largest-to-Date Study on Glioblastoma Boundary Detection
In December 2022, researchers at the Intel Labs and the Perelman School of Medicine at the University of Pennsylvania led the largest-to-date study to identify malignant brain tumors. The project was made possible thanks to federated learning — a distributed machine learning (ML) artificial intelligence (AI) approach that allows aggregation of medical information across the globe without compromising patient privacy. This largest global federated learning effort is based on an unprecedented dataset of 6,314 glioblastoma (GBM) patients at 71 sites across all 6 continents, and demonstrates a 33% improvement in brain tumor delineation compared to models trained on publicly available datasets.
In this article, let’s discover the significance of this research to patient prognosis and understand the benefits of federated learning!
Current Challenges to Glioblastoma Delineation
What is Glioblastoma?
Glioblastoma (GBM) is a fast-growing and aggressive brain tumor, known for being the most popular and fatal form of brain cancer. A patient diagnosed with GBM only has a survival time of just 3–14 months after standard treatment. Although the disease has been extensively studied and treatment options have expanded significantly in the last 20 years, there has not been an improvement in overall survival rates.
Limitations of Current Glioblastoma Treatment
The struggle with improving treatment and quality of life for GBM patients reflects major obstacles. First, the brain tumors’ intrinsic heterogeneity makes detecting and identifying boundaries of the affected region a challenging task. Second, the lack of improvement in treatment outcomes highlights an urgent need to analyze larger and more diverse data in order to achieve a better understanding of GBM. This is extremely important as modern healthcare begins shifting from reactive to proactive scanning for early detection of tumors.
Currently, hospitals and medical institutions are facing a challenge as the number of skilled radiologists cannot keep up with the number of medical images generated. AI models have been proven to be effective in automating scan analysis, yet accuracy is still a primary concern. One solution might be using larger training data to boost accuracy, however, most hospitals tend to hold back from sharing data due to security issues.
Federated Learning to the Rescue
What is Federated Learning?
Federated learning — an approach first developed by Google for keyboards’ autocorrect functionality — trains an algorithm across multiple decentralized devices or servers holding local data samples, without exchanging them.
In a medical context, a model trained at a hospital can be distributed to other hospitals across the world, where their doctors can input their own patients’ brain scans and train on top of the shared model. Next, the new model will be sent to a centralized server, which aggregates local models and combines them into a global model that has gained knowledge from each of the hospitals. The system resolves data privacy concerns by maintaining raw data within the data owner’s computing infrastructure and only allowing updates computed from that data to be sent to the central aggregator, not the data itself.
Model Training with Federated Learning
The joint research between Penn Medicine and Intel Corporation followed a staged approach.
- The first stage “Public Initial Model”: The AI model was pre-trained using a public dataset comprising 231 cases from 16 sites. The outcome of this stage was to let the model learn to identify boundaries of 3 types of GBM sub-compartments, including enhancing tumor, tumor core, and whole tumor.
- The second stage “Preliminary Consensus Model”: Uses the public initial model and incorporates 2471 patient cases from 35 other sites to improve accuracy
- The final stage “Final Consensus Model”: Uses the model from the second stage, and incorporates the largest amount of data from 6314 cases at 71 sites from 6 continents to optimize and evaluate generalizability to unseen data.
At each stage, scientists separated 20% of the total cases provided by each site from the model training process and used them as “local validation data”. To test for the model’s ability to generalize, 6 sites with 590 cases were completely excluded from all training stages to represent an unseen out-of-sample population.
Research Outcomes and Implications for Glioblastoma Treatment
At the end of the final stage, the model demonstrated significant advancements compared to the collaborators’ local validation data. In particular, the final consensus model increased accuracy in enhancing tumor boundary detection by 27%, core tumor boundary detection by 33%, and whole tumor boundary detection by 16%. The improved outcomes clearly indicate the benefits that can be yielded through access to larger and more diverse databases of patient cases to refine the model and validate it. The more data input into machine learning models, the more accurate they become, which consequently strengthens doctors’ understanding of the nature and treatment of even rare diseases, including glioblastoma.
Overall, in order to make progress in treating diseases, scientists need to be able to access vast quantities of medical data. However, this data is often too large to be generated by a single facility alone. The collaboration between Penn Medicine and Intel Labs in glioblastoma brain scan analysis has shown that large-scale federated learning can be effective in utilizing data from multiple sources, thereby unlocking potential benefits for the healthcare industry. By breaking down data silos across multiple sites, this approach could lead to earlier detection of diseases, potentially improving quality of life and increasing the lifespan of patients.
Thanks for reading!
If you are looking for information about artificial intelligence, machine learning, general data concepts, or medical data science applications, follow us to acquire more useful knowledge about these topics.
Open source project: https://github.com/vinbigdata-medical/vindr-lab