How to Make Effective Use of Public Datasets in Oncology: Q&A With Dr. Eduardo Gonzalez-Couto
A recent review very succinctly put it that for precision medicine to achieve its aim of “the right drug for the right patient at the right time, may be possible only if the right data come to the right clinic at the right time”. In the field of oncology itself, millions of dollars have been spent in collecting vast amounts of data for a large number of cancers in projects such as The Cancer Genome Atlas (TCGA), Cancer Cell Line Encyclopedia (CCLE), Genomics Evidence Neoplasia Information Exchange (GENIE), Genomics of Drug Sensitivity in Cancer (GDSC), and Gene Expression Omnibus (GEO). Below is a Q&A session with Dr. Eduardo Gonzalez-Couto, Bioinformatics Product Strategist & Manager at PerkinElmer, to get his views on what we need to make these public data more accessible to end-users.
Simone: Eduardo, you have been in the genomics space for quite some time now and have intensive experience of using public databases that collect and curate large amounts of data being generated. What according to you are the main benefits of using these public databases?
Eduardo: To understand the aetiology of rare diseases (including certain cancer types) we need large datasets that will give us enough statistical power to test our hypothesis. Massive datasets have been collected and curated by large consortia to be accessed publicly. As such, these public databases have a wealth of data in them, and if mined effectively they can enable the end-users to validate their results, and test new hypotheses for novel diagnostic and therapeutic applications. For example, during the 12th Annual Pharmaceutical IT Congress Roche presented a project whereby they mined data from TCGA and reprocessed it using smart algorithms, (including a filter for false positives) to take advantage of the exon expression imbalance observed in nature. Using this genome data mining strategy Roche scientists were able to identify novel chimeric fusion proteins and launch new internal validation projects. Such activity can only be done when we have access to large datasets.
Simone: The use of open source code has really driven forth the creation of these public databases. Although over the course of time they have developed a more user intuitive user-interface, accessing data from these resources is still fairly cumbersome if you do not have the right bioinformatics skills. Would you say that is the key challenge associated with using these databases?
Eduardo: I would say it is one of the main challenges but by no means the only challenge. Public databases focus on certain goals such as collection and curation of the data (and rightly so). Their priority might not be to make the system super intuitive for end-users, and that probably is the right strategy for these public consortia. However, generating and storing data in large public databases only becomes a valuable asset once they can be accessed, searched and effectively mined for analytics, and not just by scientists with informatics skills but also by the domain experts who can use their therapeutic area knowledge to put a biological context onto these data. This is where I believe commercial vendors can bridge the gap by making large scale data available to the end-users in an efficient manner, whether it is for validating existing hypothesis or to use the data in for creating new modelling techniques or algorithms.
Simone: In order to derive the sort of meaningful insights that you propose we first need to enable a seamless access to these data. Could you talk about other challenges that hinder the use of these public databases to their full extent?
Eduardo: The primary expectation of these databases is often that you have the bioinformatics skills to extract the necessary data out. On top of that any additional data integration, enrichment and analytics require specific and deep expertise, and it is no easy feat when multiple studies are being combined. This all becomes a massive stumbling block to scientific experts who have the skills to put a biological meaning to the data, but are reliant on others with bioinformatics skills to perform the routine task of extracting, integrating and performing analytics on the data. Additionally, bioinformaticians are highly skilled people with great technical expertise whose time is better utilised to develop newer algorithms as opposed to doing more routine work of extracting and integrating data, and performing basic analytics. Enabling end-users who want faster access to data and analytics to validate their hypothesis allows them to pick the lower hanging fruit faster. This also frees up the bioinformaticians to devote their time to creating and testing more complex modelling strategies.
Simone: Of course these situations are not mutually exclusive. Once more complex models have been created they will also need to be validated in an environment where ultimately a biological context can be applied for clinical utility?
Eduardo: Precisely. Even complex methodologies ultimately need to be deployed in a user-friendly system to assess whether there is a valid clinical utility, and this needs to be done in a scalable manner. Our main focus should always be to enable end-users to have confidence in the analytic questions they are asking, and we need to address that using a combination of enhanced visual user interface (UI) with accessible analytic workflows that can be repurposed in a routine manner. Such a strategy is essential to extracting the most out of large datasets whether they are public or proprietary.
Simone: TCGA is one such large public database that does have a modern UI and some user-friendly analytics to enable easier access to data. Additionally, there are other publicly available tools that use TCGA data for genomic data mining. What do you think are the gaps in these tools that might hinder effective data mining?
Eduardo: There are publicly available tools that help with visualisation, analyses and interpretation of TCGA data. They have their advantages but are limited in the analytics they offer, and quite often do not allow for effective cross-study analysis across multiple cancer types. Therefore, as soon as you want to run complex analytics with multiple TCGA cancer datasets or combine it with other public or proprietary datasets you can run into difficulties, unless you are a highly trained bioinformatician. Furthermore, a practical integration of large datasets across different databases still remains a fairly complex problem. For effective data integration we need an information model that is robust enough to underpin the similarities in multiple datasets but is also flexible enough to treat them as independent datasets. Once you have an underpinning model, then datasets can be mapped accordingly for cross- study analysis. This step can additionally be automated so that an end-user doesn’t need to burden a bioinformatician every time they need to access and integrate datasets. The whole process can be made more efficient and reduce the burden on the resources and control the overhead costs.
Simone: Surely we can’t have optimal data integration without optimal harmonisation of data. Could you discuss some strategies around ‘semantic normalisation’ when integrating datasets across multiple studies?
Eduardo: Yes you are right- when integrating datasets within a study we are used to the idea of a mathematical data normalisation but when we want to combine across multiple studies we also need to apply an additional layer of semantic normalisation for a homogenous understanding of the data. For example, some studies might use the term Subject ID and some might use Patient ID and we need to harmonise them across the datasets that we want to integrate. Furthermore, you can apply ontologies to add an extra edge to the data, and these can be as simple (like the example mentioned) or as complex as you want them to be e.g. using genomic ontologies to apply the function of the mutation across genes. Automation of certain routine procedures can significantly streamline processes and reduce long term dependencies but again it needs to be done in a scalable manner.
Simone: One final question, we have discussed challenges related to access, integration and analytics of large public datasets but what about the infrastructure?
Eduardo: From an infrastructural point of view, the changing storage and processing needs of combining large datasets such as TCGA to other datasets (public or private) means that you need a central system that can be scaled up and down. A cloud based system can address this very easily and not only significantly reduce infrastructure costs but also adds agility. Additionally, automation of certain routine procedures can significantly streamline processes and reduce long term dependencies but again it needs to be done in a scalable manner. Currently I don’t believe there is a public system that allows for all these challenges to be addressed in the same platform in a scalable manner. This is a gap for commercial vendors to bridge. I think a platform that allows for effective searching, access and ability to perform analytics in a scalable and user-friendly environment is just the first step in utilising these public resources more effectively.
Want to know more about ‘Leveraging TCGA for Oncology Research’. Then watch this webinar.
Published on Linkedin on July 24, 2018