# EP 04 | How novel methodologies and analytics are powering integrated evidence

## To view this video and others, please subscribe to our Real-World Evidence Newsletter.

Subscribers will also receive access to monthly highlights covering newly published scientific literature, best practices and upcoming events featuring Flatiron real-world data.

In prior ResearchX episodes, we focused on the vision for integrated evidence, what it actually is and how it’s captured. This episode will expand on the notion of integrated evidence by showcasing innovative methods and actions that can be used to produce evidence that is more than the sum of its parts.

## Transcript

**Narrator: ** Previously on ResearchX.

**Alex Deyle:** Current processes are incredibly burdensome on sites.

**Len Rosenberg:** Picture doing 5, 6, 7, 8, 10 studies simultaneously, it's not going to work using the old methods.

**Nelson Lee:** This manual EDC data entry practice has become inefficient, and perhaps it's not even sustainable in the near future.

**Lauren Sutton:** In response to those conversations, we built and deployed research specific workflows directly in the electronic health record.

**Len Rosenberg:** When you go ahead and do precision matching of clinical trials using this master trial approach, you actually extend survival compared to standard care, more than fourfold.

**Nelson Lee:** EHR to EDC is going to pave the way for new data collection strategies in clinical trials.

**Alex Deyle:** It truly feels like we're approaching a tipping point.

**Olivier Humblet:** Hello everybody and welcome to ResearchX. My name is Olivier Humblet and I'm a Senior Quantitative Scientist here at Flatiron Health. I'm very excited about this ResearchX episode today. It's wonderful to see attendees from across the industry joining us, from biopharma and academia to policy groups and more. This is the fourth episode of our ResearchX 2022 season, where we're exploring how integrated evidence can transform oncology research and patient care.

So let's start by briefly reviewing this concept of integrated evidence. We think of integrated evidence essentially as evidence that's more than the sum of its parts. In previous episodes, we talked about the vision and potential impact of integrated evidence, and we introduced a framework for creating it in the form of three phases. Generate, combine and analyze, with an emphasis on the generate and combine stages. Today, we'll extend into analyze, showcasing some of the innovative methods that are being used to derive new insights from integrated evidence.

I'm very excited to introduce our speakers who will help bring this all to life. First, Daniel Backenroth, Scientific Director from Janssen will discuss how to pool datasets from multiple sources and dive into some of the statistical nuances of quantifying heterogeneity when this is done for multiple realworld comparator cohorts. Next, David Paulucci, Associate Director of Data Science from BMS, Sanhita Sengupta, Senior Manager of Data Science from BMS, and Katherine Tan, Senior Quantitative Scientist from Flatiron, will discuss how RWD can be integrated with clinical trial data using hybrid control arm designs. And finally, Jeff Leek, Professor and Director of the Johns Hopkins Data Science Lab will present on using post- prediction inference in the analysis of variables generated using machine learning. While we have a packed agenda, we will have time at the end to ask your questions to the speakers.

So before we get started, a few quick housekeeping items. First, I'd like to draw your attention to the Q&A option available throughout the webinar at the bottom of your screen. Feel free to submit a question anytime, or to reach out to us afterwards if you'd like to discuss any of today's content in more detail. If you have any technical questions or issues, please let us know via the same Q&A tool, and we'll do our best to help. Although this hardly needs saying nowadays, please excuse any interruptions from our pets or loved ones. Like many of you, many of us are still working from home.

So one very last thing before we get started. We'd like to learn a bit more about all of you. You'll see a poll pop up on your screen momentarily. Other attendees will not be able to see which responses you choose. Our poll question is, which type of methodological use cases does your organization use? You can select all that apply. The answer choices are, A - pooling RWD to increase power or generalizability, B - hybrid control methodologies to integrate trial in RWD, and C - analysis and machine learning derived variables. Let's close the poll and share the results. Fantastic. Thanks everyone for providing your input. It's really helpful to get a pulse check on how your organizations are using innovative methods and I'm excited for you all to hear the case studies we have on deck today, which will showcase how these are being used in practice.

Now, let's bring in Daniel, who will discuss one way of generating integrated evidence, polling real-world data from multiple sources with a specific focus on using this pooled RWD for real-world comparative cohorts. Over to you, Daniel.

**Daniel Backenroth:** Thanks Olivier. It's a pleasure to be here and thank you so much for the introduction. It's been a pleasure to work with you and Trevor on this presentation. I'm excited to continue working with you all. Can I get the first slide please?

So why do we pool? As we know, RWE is often used for very rare subpopulations. So in a recent work at Janssen, we were looking at patients with EGFR exon 20 insertions. They're a rare subpopulation of a rare subpopulation. So extremely rare. What we need to do in that case, it's difficult from one source to get enough real world patients to provide a robust comparative, if we're doing like an external control arm, so what we want to do in that case is pool across as many RWD sources that we can find that meet the quality thresholds that we have set. Here I've illustrated that with an example. This is actually a BMS trial and if you can see, they pooled from many, many, many data sources, and they only got 190 participants overall. So this, again, once you're applying all the inclusion criteria that you have in these comparative studies, it really becomes very important to pool.

So how do we go about pooling? So as I mentioned, the first thing and the most important thing is to make sure that the data sets that we're using meet the quality threshold that are fit for use . We need to make sure that the data sets are similar enough that the values are harmonizable. So an example is lines of therapy. For example, when we had Janssen look into this, we had three vendors and they all constructed lines of therapy in their own specific way. What we needed to do is make sure that we were able to get the raw data from those vendors that would enable us to make consistent lines of therapy across all the data sets so that we could compare those participants to the participants in the single arm trial that we had. So we've selected the dataset. We know we have the raw data that we need to harmonize. Then we need to go about harmonizing, which is an extremely time intensive task.

Then the third step is, we've used priority principles, like we've talked to the vendors, we've looked at the processes, we think we can pool these, we think it's reasonable to pool across these data sets. We can also look at the data itself to tell us, is the data consistent with pooling? So that's step three here and that's what I'm going to focus on in this presentation.

Then the 4th step is to pool the data sources and analyze. So as I mentioned, the problem statement is how do we assess heterogeneity across these novel data sets. In this case, the application is an external control arm. So we have multiple real-world control arms with a single-arm trial, SAT is a single-arm trial. RWCC is a real-world control comparative. We want to evaluate the consistency of these comparative cohorts. CC is comparative cohorts.

So the key assumption we have that motivates using an external control arm is that after adjustment for confounders , if we compare the single-arm trial to the comparative cohort, we'll get essentially the same estimate as we would've gotten from a randomized control trial up to sampling variability, of course. So what that implies is if you're doing multiple such comparisons against distinct comparative cohorts, there shouldn't be heterogeneity among those results.

So what we'll do in this presentation is talk about ways to assess that heterogeneity or homogeneity. So we need to be careful in this case. It could be that we find homogeneity, meaning all the comparisons look fairly consistent, that doesn't prove that the comparative cohort assumption is correct, meaning it doesn't prove that we have an unbiased estimate of the treatment effect, because all the comparisons could be biased in the same way. So for example, in the Janssen study that I mentioned, all of the comparative cohorts were from the U.S., they only used patients treated in the U.S., whereas the trial had patients treated in Asia and other places. So in no dataset could we control for that potential bias. So if we see consistency among all the comparisons, it could mean that they're all biased.

Nevertheless, presenting evidence with homogeneity can raise confidence in the analysis. Even though as I mentioned, it doesn't prove that there's no bias.

Where can heterogeneity come from? There's two main sources that I think of. One is unmeasured confounding. The populations that we're pooling from these different datasets could be different and we might not have information in those datasets to account for the confounding. There could be different baseline characteristics across the datasets. There could be different supportive care after the baseline in the datasets. There could be different treatments received. This is when clinical input becomes really important to be able to decide, are these datasets really poolable? If we're pooling data from the US and Asia, is the standard of care similar enough in those two places that it makes sense to pool the data? Then another source of heterogeneity is the data quality could be different. Or the data quality could be good in both datasets, but it could be that information is collected in a different way, stored in a different way, and definitions could be different. An example of this could be if deaths or progression events are missing from one dataset more than from another. That would make it potentially not justifiable to pool the datasets.

What I'll talk about next is two methods that we consider to evaluate consistency. One is what we call the aggregate method. So what we do here is we take each comparative cohort and compare against the single arm trial. So we get an effect estimate for each comparison, single arm trial to the comparative cohort and then we can compare those effect estimates to each other. That's like a classic meta-analysis type of analysis. Then we can also consider an individual patient data method. What we do here is we ignore the single arm trial. We just compare the real-world comparative cohorts to each other, potentially after matching or waiting to the single arm trial. What I'd like to focus on here is the aggregate method.

One benefit of the aggregate method is that we can calculate, we can evaluate heterogeneity, even if the sponsor lacks access to all the global datasets that are used. This is actually quite a common situation where a disease registry is used to carry out a comparison to a single arm trial. Either those datasets live in jurisdictions with strict data protection rules, or they're owned by organizations that are subject to limitations on data sharing. I'll show how we can use the aggregate method even when we lack access to all the real-world datasets at the same time.

The aggregate method, the standard method for this is Cochran's Q test. This is a method for testing the null hypothesis of homogeneity in meta-analysis. We use a weighted sum of square deviations around the weighted mean. This quantity which is found in the middle of the slide, it's a sum of a weighted deviation. The weights are those Ws. Then the deviations is the Xi minus the Xw bar. So the XW bar is like a weighted mean, which is kind of adding across all the datasets how different is each dataset from the mean. There is an assumption embedded in Cochran's Q test. It assumes that the effect estimates are independent. In the classic meta-analysis, we have lots of studies carried out by different research groups and we want to combine them into one estimate. In this setting, the estimates are actually dependent.

They're not independent because every comparative cohort is being compared to the same single arm trial. One half of the comparison for each of the estimates that we're combining is fixed. It's the same across all of them. We need to adjust for that. What we pose is a simple adjustment of the Q test. What we do is we calculate the estimated covariance matrix of the vector of statistics from each comparison. Then we just transform the statistics to be independent with an identity covariance matrix, and the standard Q test can be used. How can we calculate this covariance matrix? What we do and what we suggest is to use the bootstrap.

Now, what we can do is we can resample from the single arm trial and the comparative ones, calculate estimates, for each compare, we'll get a thousand estimates, and then we can take those estimates and compare the covariance. And this is the step that makes it possible to carry out this test of heterogeneity even when the sponsor doesn't have access to all the datasets. Because as long as the bootstrap samples for the single arm trial are identical, which can be assured using a shared random seed, we can calculate the covariance of estimates. So we know that the single arm trial bootstrap samples are the same in each of those comparisons from the comparative cohorts.

We did a quick simulation to make sure that this was justifiable. We assumed a hundred participants in the single arm trial. In each of the global comparative cohorts, we assume that 50% of the response rate in the single arm trial and in one of the comparative cohorts, and then we varied the response rate in the second comparison cohort. This could have been due to unmeasured confounding, or some kind of measurement error or some other variable that's causing bias. And what we do is we compare the Q test, the adjusted Q test, and the IPD method, in this case, logistic regression on the probability of rejection of the null hypothesis.

What we have here is we have the simulation results. We have three different methods. Again, we have the Q adjusted Q, which is our simple modification to the Q test, and then we have the IPD method. We see that the adjusted Q and the IPD method have essentially the same probability of rejection. The Q test, which assumes independence, but doesn't account for the lack of independence, does not have either the right type one rate or enough power. One thing I'd like us to note is that the power is fairly low, even for a quite meaningful response difference of 20%. The power is only about 80% for the adjusted Q IPD comparison.

To summarize, pooling global data can improve parallel generalizability. As I mentioned, regular status selection and integration are the fundamentals here. This we're thinking of more that it is a descriptive statistical test to raise confidence in an analysis that uses multiple global comparative cohorts. We don't really suggest a two stage approach where the sponsor carries out the test of homogeneity and then decides what to do based on that test, because that test is fairly underpowered. The operating code rule six of that procedure won't be that favorable. As I mentioned, the Q test is inappropriate in this context. So we've posed an adjustment to the Q test that can be used instead. One favorable characteristic of this is that it can be used even if the sponsor cannot access all comparative cohorts by using the bootstrap. I'd just like to finish by noting that this is a joint work between Flatiron and Jansen, my collaborators Olivier and Trevor. We've been working together for a while now and have a good collaboration. The sponsors are Meghan Samant from Flatiron and Jose Pinherio from Jansen. We hope to publicize these results soon.

**Olivier Humblet:** Fantastic. Thank you so much, Daniel. It's fascinating how many considerations there are for improving RWD and specifically, as you mentioned around heterogeneity testing. With that, I'd like now to introduce David, Sanhita and Katherine to discuss how RWD can be integrated with clinical trial data using hybrid control arm designs. Over to you, Katherine.

**Katherine Tan:** Thank you, Olivier, for the introduction. I am delighted to start this presentation on hybrid control designs. For those of you with us during Episode 1, Somnath introduced this concept, focusing on the simulation work that Flatiron had conducted. Today, we'll be looking at the application of hybrid control designs using real-world data. In drug development, randomized clinical trials remain the gold standard for evaluating risk benefits of new drugs and regulatory decision making. However, as many of you know, they're expensive, take a long time and could face enrollment issues. What if there is an external data source of patients who already receive a standard of care? This data source may then be used to supplement a randomized clinical trial. This is precisely the premise of hybrid control designs, which borrows information from external data and can result in more efficient trials through shorter enrollment and faster trial timelines. This is an example of a hybrid control chart. Instead of a typical two arms in the traditional RCT, the experiment and the control, there is a third arm.

An external control coming from the real world. This external control could be accrued concurrently of the RCT or if standard of care has not changed in the disease space accrual may happen prior to the first patient in the RCT. In this illustration, the cohort occurred for all three arms and at the same time. This fixed cohort is then followed up to observe outcomes such as overall survival. One unique feature of hybrid control designs is the step to assess borrowing. This is under the premise that not all external data sources can and should be borrowed as is into the RCT. So for example, if the external cohort is less similar compared to the trial cohort, then you might imagine the external cohort should be appropriately down weighted to reflect that.

The combined dataset of hybrid control designs would include all experimental patients and a hybrid control consisting of both trial and external patients with appropriate down weighting. On this dataset, the key estimand of interest, which is the treatment effect, comparing the experiment to the control arms can be estimated. Ideally, using a hybrid control design will result in similar estimates of treatment effect compared to an RCT, but because external information can be incorporated, the trial will be done sooner than using a standard RCT.

The crucial step in hybrid control arms is to assess the amount of borrowing and combining information makes use of statistical borrowing methods. There is vast literature that includes frequentist and Bayesian approaches. A simple method is called the test-and-pool. It has both the RCT and the RWD control arm for similarity and borrows either the entire data source or none at all. More sophisticated methods down weights the external data source appropriately, instead of all or nothing, most of which Bayesian methods such as the power prior or the commensurate prior model. The method used in the case study we will be discussing today is developed by Flatiron and we called it two-step regression.

It is a frequentist analog to modified power prior model that uses the same down weighting idea, but is a lot simpler to implement in practice using existing software. As this name suggests, you will fit two regression model. The first regression model would compare the trial to the external controls with the goal to estimate the amount of downweighting. The second regression compares the experimental to the hybrid control with the goal of estimating the treatment effect. I would like to also point out that in both regression models, measured covariates can also be included through weighting. The statistical borrowing methods on the previous slide have mostly been illustrated on historical trial data. One of the advantages of using RWD is the ability to use controls that are fully concurrent to the RCT. This slide here shows that in practice, hybrid control designs are not a standalone method by itself, but it's actually a part of a larger project's life cycle, and then both considerations for using real-world data. It is very important to use external source data that is fit-for-purpose with critical data elements that capture a sufficient granularity and any limitations appropriately caveated. This would then allow the construction of appropriate RWD analytic cohorts through applying inclusion exclusion criteria, as well as adjusting for measure confounders.

Finally, the hybrid control methodology can be deployed. The amount of borrowing for RWD can be assessed using metrics of similarity based on either a pre-specified fixed amount, based on covariate similarity, also incorporating early outcomes. Finally, the analysis using hybrid controls can then be conducted at a projected time that meets the target number of events for the RCT. Now, let me hand it over to Sanhita to discuss a BMS Flatiron collaboration project.

**Sanhita Sengupta:** Thank you, Katherine. Moving forward, we'll be discussing the collaborative effort between BMS and Flatiron Healths we implemented the two-step regression design of hybrid control. This was basically a retrospective study where we used BMS’ previously completed clinical trial. We also used an external data source, which was the Flatiron Health research database. We had two main research objectives. One was assessing the feasibility of the real-world data cohort, which we derived from Flatiron. The next step was evaluation of the bias in the emulated hybrid control design. In the first step, which was assessment of the real-world data, we assessed both the baseline covariance, as well as outcome, which is overall survival in this case. The assessment leads to determining how much we can borrow, which is the extent of the borrower or the feasibility of this real-world cohort.

Then the next step is evaluation when we combine it or integrate with the previously completed trial. And then we calculated various metrics, the main estimate being the treatment effect and another one being the impact on study duration, where the main aim here is to actually reduce the study duration by conducting a hybrid trial. Next slide, please. This is mainly the flow chart of the algorithm which we used for the retrospective analysis. The first step, step zero over here is to filter the Flatiron Health research database using the inclusion-exclusion criteria of the previously completed RCT of BMS. And, after we derived the real-world cohort, then steps one and two basically assess the feasibility. So, we adjusted for baseline characteristics, and we also down-weighted, depending on the similarity of overall survival at an interim data point, where we determined we have assessed the feasibility, so how much can we borrow now from this real-world data?

After doing that, we then integrated with the RCT, and then finally calculated the treatment effect, which is our main estimate, using the down-weighted real-world cohort with the weights from the previous two steps.

So. I'll hand it over to David to talk about the evaluation steps and the learnings.

**David Paulucci:** You can go to the next slide. Great, thanks very much.

So, just to talk about how we quantitatively evaluate how well we were able to emulate the completed trial. We had some predefined success metrics for the emulated hybrid control trial, which were achieved. In particular, under the hybrid design, we saw that the overall survival hazard ratio was in the same direction, and within the confidence interval of the actual trial. Also, the primary metric was the impact on study duration. What we had seen based on the number of events that could be borrowed from the Flatiron real-world data, the emulation demonstrated that we could potentially reduce the trial by 7 to 11 months. However, it's really important to consider the results in the context of the generalizability of real-world data to the trial in this space.

Something that we had seen in particular is that, despite some methods to reduce confounding, we did see that some patients had worse survival in the real-world cohort as compared with the trial control patients. This has some implications with regard to interpreting the treatment effect in this context.

We just want to stress how important it is to have high quality real-world data, and really strictly align the trial criteria to the real-world data, and to use analytic methods, for example, multiple imputation to power a baseline characteristic alignment, to increase that generalizability. You can go to the next slide.

Overall, just talking about some lessons learned, this patient level emulation exercise using hybrid control designs via real-world drug sponsor collaboration is a really important advancement in the area of integrating real-world evidence and randomized control trial data. Going to the lessons, again, it's really critical to maximize data availability and completeness for key prognostic and confounding variables to successfully derive a real-world cohort closely aligned to the trial. One of the more important examples of this was ECOG performance score. Considering some of the missing things that we had observed since this cohort was a little bit older, what we saw is that when we did use multiple imputation, we were able to derive a cohort for which the outcome was more similar from imputing the ECOG data.

It's also important to evaluate numerous statistical borrowing approaches, which is something we did, which is probably necessary to inform the optimal method in its study context. Also, robust application of other statistical methodologies that pertain to using real-world data, not specific to hybrid control, is important to consider, such as propensity score adjustment. Also, timing of interim analysis of which the borrowing decision is made needs to be carefully considered to optimize trial timeline savings.

That was it for my presentation.

**Olivier Humblet:** Wonderful. Thank you so much, David, Sandita, and Katherine. This is such an impactful use case for real-world data. It's very exciting to hear about this innovative work. Fantastic.

So, quick reminder to everybody to submit any questions via the Q&A tool at the bottom of the screen, as we'll be starting the Q&A after the end of the next talk.

Now, unfortunately, our final speaker Jeff Leek was unable to be live with us today, but he prerecorded a wonderful presentation especially for us, which we'll share with you now before we move into the Q&A. During this talk, Jeff shares his insights on post-prediction inference when analyzing data generated using the predictions from machine learning models.

**Jeff Leek:** Thanks very much, and I wish I could be there with you live, but I'm really excited to tell you about the work we've been doing on post-prediction inference, which is a statistical problem that comes up once we have machine learned everything. I tend to talk pretty fast, and have a lot of slides, so if you go to jtleek.com and look for talks, you'll be able to find the talk slides from today's talk.

I just wanted to disclose upfront. Currently, I'm a professor of Biostatistics at Johns Hopkins, and in the summer I'll be moving to the Fred Hutchinson Cancer Research Center where I'll be Chief Data Officer and Vice President, and professor. I also do some various outreach work with Coursera, and have co-founded a couple of companies that I mention here.

I'm going to be talking about methods for correcting inference based on outcomes predicted by machine learning. This is the more formal definition of my talk title. This is a paper that we published, and you can go find the details at the link below. Anything that's good that I'm going to be talking about is probably due to these two people. Sarah Wang is a former student of mine, and Tyler McCormick is a collaborator of mine at the University of Washington. Anything that sounds silly is almost certainly due to me, and anything great is probably due to one of them.

We're going to tie off with a very important equation that happens in every biostatistician's life. You think about N is the letter we use to denote sample size, and a very important calculation you learn early on is the way to calculate your sample size is how much money you have to spend on a clinical study divided by the amounted cost per sample to collect that data, will give you a good estimate for how much data you can collect.

It turns out, though, that in almost every area, and certainly in the research area I work in most often, data is becoming much more inexpensive over time. So, this is a very common plot you see in genomics talks where you see the cost of a single human genome, which has plummeted from $100 million in the early 2000s, to less than $1,000 today to sequence a single human genome. According to that equation that I showed you just a minute ago, the very simple equation as the cost of collecting data goes down, the amount of data you can collect, or the sample size increases. This plot shows the amount of data that's being collected in the Sequence Read Archive, which is a public-facing genomics sequencing database, and it's just been exploding over time with more and more data being publicly available, and freely available.

With all of this data, you'd think things like machine learning and artificial intelligence are obviously coming to the fore, but as those predictions get made, they don't always solve the whole problem. And so, you see a lot of excitement around AI. And, at first we started to wonder, as a biostatistician, or as a statistician, does that mean we just have to shut the whole thing down? Or, does that mean that we need to have a new way of thinking about the way the world works? As we started to observe that more and more people were using machine learning on these large collections of datasets, we figured out that something that often happens is, while you're collecting one type of data in a high throughput sense, that could be the genomic data that we see most often in my research group, or electronic health record data, you're actually predicting another variable that you want to use in your downstream modeling. So, you might be predicting a phenotype from billing records, or a phenotype from gene expression data. And then, you want to use that phenotype in a downstream statistical model.

But, that can cause some problems. If you look at a very simple, generalized linear regression model type framework, you can see that you might be modeling the relationship between some phenotype that you care about, and a covariate that you've collected with high throughput data. And, the typical thing that you would do is you would just measure the relationship between the observed phenotype and the observed covariate. But, if you're making a prediction, it changes the model a little bit. So, now, instead of using the phenotype that we actually observed, we're making a prediction of a clinical phenotype from genomic data, or health record data, and then we're using that in a downstream statistical model.

It turns out that this can cause problems, because, if you look at the predicted data, they tend to be less variable than the observed data. Here, I'm just showing a really simple simulated example, and I'm showing you a simulated data set in gray dots, and the blue dots represent the predictions from a machine learning model. And, you can see that they're a little bit more tightly clustered. So, the variance has been reduced. What that translates to when you use those predictions in downstream statistical models is smaller variances for all of your estimates, which means incorrect P values, incorrect test statistics, and you can see some bias in the results that you're getting. Here, I'm showing the estimated variance from regression models using predicted phenotypes, which are much smaller than what happens if you use the observed data.

It turns out that machine learning hasn't solved all of our problems, and we have to figure out what we're going to do when the machines don't necessarily give us exactly what we need. I'm going to use a concrete example to talk a little bit about how we do this. My group is focused on developing biomarkers, for example, using gene expression data for various different cancer outcomes and other complex diseases. One of the things that we do in my group is develop large collections of gene expression datasets that have been pre-processed so that we can analyze them quickly. Typically, if you want to do a study where you're doing a biomarker study in genomics, you have to collect and process data and get it organized before you can actually do the analysis that you care about.

So, we developed a collection of about 700,000 now gene expression datasets from samples that we've collected from groups all around the country that are publicly available, and process them on a common pipeline, and cleaned up to genomic data. But, one of the things that is a problem with this data is, even though if you process the genomic data, the metadata is often missing, and the clinical phenotypes are often missing. That's a piece that we still wanted to try to fix. Even though we had this genomic data that was processed, this is Recount two, a version that had 70,000 samples, we've released Recount three now with 700,000 samples, the missing piece is the phenotype. We have the genomic data, we're missing the phenotypes. It's a similar model that happens when you have electronic health record data that includes billing information, or unstructured information and clinician notes, but you need to translate that into a clinical phenotype.

Unfortunately, in most of the publicly available data, a lot of the variables we care about are actually missing or incomplete. And, even when they are included in the datasets, they're often complicated, or they're labeled in ways that make it difficult to use this information. For example, if you're just looking at a variable like sex, and the SRA, it might be labeled as a mixture, it might have various different labels for male and female. If you want to be able to look at this information across many samples, you have to be able to standardize.

And so, what our group did was develop a machine learning method for predicting the clinical and demographic variables for these large collections of publicly available genomic data. I'm not going to go into the details here, but the predictions work pretty well. Now we have a collection of phenotypes that we can use with the gene expression data. This is work by Shannon Ellis, who's a former post-doc in the group, and now a faculty member at University of California, San Diego.

Now we have both genomic data and clinical metadata, and we want to start answering questions with the recount resource. And, the challenge is that our metadata, and our clinical phenotypes are all predicted, they're not actually the observed values of those variables. And so, we're doing a weird thing where we're predicting the phenotypes from the gene expression data, and then we're taking those predictions and using them in models related to that same gene expression data. It's a little bit of a circular pattern where you predict from the gene expression data, then use that predictive variable to do a regression model. But, it happens everywhere. It happens a lot in genetics. This is an example from transcriptome-wide association studies, but it happens all over the place. It happens in single cell RNA sequencing data, where you're trying to predict cellular trajectories, or cell types. And then, you want to use those in downstream statistical models. It happens in polygenic risk score modeling, where you want to make predictions about phenotypes of patients, and then use those predictions in downstream clinical risk models. And, it happens a lot when you're looking at real-world evidence and electronic health record data, where you want to make predictions from billing records and electronic health record data to predict clinical outcomes of clinical phenotypes.

So, the challenge is we're using these predicted variables and downstream statistical models. And, that can cause all sorts of problems, and so we need what we call post-prediction inference. And, the real idea is how do we go from a regression model, where we're using the observed outcome, to a regression model, where we're using a predicted outcome? And, the real challenge is the predictions can be complicated. It's not like we're using really simple prediction models most of the time. It might be something like a random forest, and more recently it's going to be largely some kind of deep neural network that will include a very complicated mathematical formulation, and a computational formulation to make those predictions. And, it's really hard to model the statistical characteristics of these very complicated prediction models. So, how do we get around the fact that it's hard to model something as complicated as a neural network in a statistical model?

The real challenge is that these models often don't even get published. You often don't know what the hyperparameters are, don't know what the layers are in the network. And so, you have to actually build a method of inference that's agnostic to how the prediction actually is generated. One key observation, and I want to credit here Sarah Wang, my former student, for making this observation, is that if you plot the observed outcomes versus a covariate in a regression model, and you plot the predictive outcomes versus the covariate, of course they look relatively similar in those two models, especially if your prediction is good. But, the real key observation is that there's often a very simple relationship between the predicted outcomes and the observed outcomes, regardless of how complicated your machine learning method is. For example, here, this is a predicting outcomes with

a K-nearest Neighbors machine learning approach, which is a fairly standard approach, and you see that there's this relatively simple linear relationship between the observed outcomes and predicted outcomes. It turns out if you use a neural network, a Random Forest, K-nearest Neighbors, SPM, you always observe this really simple relationship between the observed and predicted outcomes. So we can take advantage of this simple relationship model, the relationship between the observed and predicted values to correct our statistical inference. There's a lot of mathematical detail behind this, and so I'm not going to really go into that here today. We don't have time. You can read the paper. Of course, I'd be very excited if you did. But the basic idea here is we used this relationship, this known simple relationship between the predicted outcome and the observed outcome, to update our statistical model and correct estimates of variants and correct bias in these models.

So we did a lot of work in the original paper on this looking at simulations, where we plot the standard error that we would get with the true observed outcome on the X-axis and the standard error we would get from that same regression model using a predicted outcome, and if you don't do any kind of correction for the fact that you use the prediction, you get a deep underestimate of the predicted value, and if you correct it, then you get about the same thing you would've got. Even if you used the predicted values, instead of the observed values. It turns out you can also fix the test statistics with this similar sort of correction.

So this is really exciting because we could go back to this recount study and we could do things like look at one of the variables that we really cared about was RNA quality and that is a lot for biomarkers. It's missing for many of the data sets, and we could predict the RNA quality and include that in regression models and with our correction, the blue and green dots here show that we can actually correct statistical inference that we would've got, even if we had observed those RNA quality metrics.

So the cool thing that we've been doing over the last year or so is actually working with a team at Flatiron. Arjun and Alex have been working with us and deserve a huge amount of credit. Again, anything smart is due to them. Anything silly I say is due to me. We've been really looking at how you use post-prediction inference, like we discussed from our earlier model, but for real-world evidence. So we've been designing these studies where we look at, if we structure the prediction model in different ways and we make these predictions, and then we look at various different ways of correcting our inference can we improve performance of downstream regression models?

It turns out that you can actually do quite a bit better, here the threshold method and bootstrap method are sort of using the predictions without any correction whatsoever, and I'm showing the bias in the hazard regression coefficient for a series of models for PD-L1, you can see that with some calibration or invitation- based approaches, you can improve the bias that you would get if you just use the predictions without paying any attention to the fact that you did them as a prediction. So this is really work in progress. It's really early stages, but it's exciting, and it shows that there's really a way that we can improve and correct inference, even with machine-learned approaches. So we really can work with machines. This is a robot puppet that I made with a colleague of mine, James Taylor here at John Hopkins, but it shows that we can kind of work together with the machines and use machine learning approaches and still get correct statistical inference. Thank you.

**Olivier Humblet:** Fantastic. Thank you all for these amazing presentations. They are really insightful, and helpful in illustrating the real-world impact of these methodological innovations. Before we dive into the Q&A, we want to take a moment here to discuss how everything today connects. An overall theme that emerges from all these is that utilizing integrated evidence requires thoughtfulness and methodologic rigor. When pooling data, meaning that we're deciding whether or not to concatenate data for different groups of patients, we've seen, we need to be really mindful of the data set's specific nuances. It can determine whether or not it's appropriate to do this. That includes the really statistically challenging aspects of heterogeneity assessment that Daniel discussed.

Next hybrid controls are a more sophisticated way of combining data from different groups of patients in this case trial plus RWD, where the RWD can be integrated into the trial control arm using statistical borrowing methods to increase the efficiency of running clinical trials. This is really exciting. Then Jeff's work brings in another dimension, integrated evidence, which is that instead of bringing data for more patients in, as we were doing before, now our goal is to bring in another data element for the same group of patients in this case, data generated using machine learning, and he showed how thoughtful and rigorous we need to be about what methods we used to integrate new variables into our analysis.

In summary, we're very excited to see that RWD methods are reaching the level of sophistication and rigor we've seen here today. It's really an exciting time to be in this field. So with that, let's begin our Q&A discussion. We're seeing some great questions come in, and also if there are any specific questions for Jeff, please ask. We'll share your questions and have him address them offline. Okay. Let's kick things off with a question for Daniel. Daniel, from your experience, can you give an example of types of data for which it would be obviously inappropriate to pull?

**Daniel Backenroth:** Thanks, Olivier. I think if you can't harmonize key data elements, if you think that an outcome is measured in a fundamentally different way into data sets, then you just wouldn't want to pull. Finally, I mean, if the population's so different. So I think there, you need clinical input to make that determination, but another case where you wouldn't want to pull.

**Olivier Humblet:** Absolutely. That's great. The next question is for Katherine. Katherine, one of the concerns in combining RCT with RWD data, for example, external control arms, is that the patients could be quite different. How similar should the RWD patients be to the RCT in order to use hybrid control arms and what if they're not similar?

**Katherine Tan:** Thanks, Olivier. This is such a good question. The question about difference is very much true when you combine any data source, whether it's a real data source or historical clinical control. But using a hybrid control arm, there's always going to be bias due to a variety of reasons. The treatment landscape may have changed. The treatment patterns may be different. So some measure confounding. So the question as to how similar do the external data sources need to be? I would say, one of the advantages of hybrid control arms compared to, say, a fully external control arm is allowing some randomization in the hybrid control arm. So that would allow you to assess some of the differences. Also wanted to call out that depending on the hybrid control method you use, some methods actually do implicitly account for similarity and borrowing, for example, test-and-pool. The tests, both the external and the RCT control for similarity, and even for many of the dynamic borrowing methods that the external data source will be down-weighted with depending on how similar data sources are.

**Olivier Humblet:** Fantastic. Thank you, Katherine. The next question is for David. David, looking at hybrids from the drug-sponsored point of view, what were you looking to achieve with this proof-of-concept emulations study, and do you think it was successful?

**David Paulucci:** Thanks, Olivier. So our main objective is to be able to do this prospectively. What we wanted to have first and foremost is a suite of tools and capabilities internal at BMS to be able to do this prospectively, which is why we tested a variety of methods and really looked to learn in our collaboration with Flatiron for when we do do this prospectively. In particular, in this example, one of the main metrics that we had looked at was reduction in study timelines. One of the main benefits that you could look to obtain when doing dynamic borrowing and hybrid controls is to reduce how long the time would've been under standard minimized control design. In this particular setting, we did see that we would potentially have reduced the study timeline. So that was an achieved success metric. So to that end, we think this is a pretty successful collaboration and think our metrics were achieved from a sponsored point of view.

**Olivier Humblet:** Great. Thanks a lot, David. The next question is for Sanhita. You mentioned some of the data completeness challenges that are inherent to real-world data. Can you speak a little about what methods were used in this study to address this and what the effect was?

**Sanhita Sengupta:** We had a couple of baseline covariates here and it was actually a huge challenge and we dealt with missingness in each variable separately, depending on the extent of missingness, pattern of missingness and if it's a prognostic variable or not. ECOG, as David pointed out, was a big prognostic variable. So we ended up doing multiple imputation. We also checked a complete case analysis where we removed patients with missing data. Lastly, we also tried a sensitivity analysis. Various methods were tried. As I said, it depends on the extent of missingness. Here in this case, ECOG was mainly missing. There might be other variables in other contexts.

**Olivier Humblet:** Great. Thank you, Sanhita. The next question is for Katherine. Katherine, what sort of endpoints can be used in hybrid control arm designs currently? Can we use PFS in response too?

**Katherine Tan:** Yeah, that's a good question. I think in the case study with BMS, we use the overall survival endpoint and just kind of based on definition-wise mortality is pretty set definition between RCT and RWD and now kind of like caveating that with the distribution of death as Daniel mentioned in his presentation as well. We need to make sure that there's no missing data and whatnot. I think the question was, can we use PFS in response too? I would say, yes, you should be able to use that. I think one thing to maybe bear in mind is to think through if there are any differences between the definitions, between the external data source and the RCT, for example, if you're using real world data source, there could be differences. There may be differences in scan timing between real-world data and the RCT, and you might want to comfort that.

**Olivier Humblet:** Fantastic. That's great. Another question for Daniel. So Daniel, you had mentioned two methods to evaluate heterogeneity. First, an aggregate method, and second, an individual patient data method. Can you speak to the trade-off, or advantages and disadvantages to both methods and when one might be preferable over the other?

**Daniel Backenroth:** Thanks. That's a good question. So in the talk, I focused on the fact that the IPD method may not be possible in all cases. So obviously if the aggregate method is the only possible method, you should go through that one. If they're both available, I think it's probably half a dozen of one, six of another. I think, as we showed in the simulations, they get pretty similar results. The IPD method may have some advantages and that you're not looking at the single alone trial data. So you can potentially maintain some blinding while doing the analysis, which you might want. If this is a preliminary analysis, again, we didn't recommend doing a conditional type of analysis where you condition what you do next on the results of the test. But nevertheless, you could theoretically maintain more of the blinding using the IPD method than the aggregate method.

**Olivier Humblet:** Absolutely. Great. Thanks a lot, Daniel. Thank you to everybody. On that end, that's all we have time for. So that wraps up episode four. Thanks very much once again to all of our speakers today for sharing your insights, and thanks to all of you for joining us. As a reminder, we've got two more episodes over the coming weeks. Next up, on April 27th, we'll hear case studies from life science partners, sharing how they've used real-world evidence to support decision making. Finally, since you weren't able to get to all of your questions, please know the lines of communication remain open. Even after we end this episode, feel free to get in touch with us at rwe@flatiron.com. Finally, a friendly reminder to please take the survey upon closing out. It'll help us improve future webinars. Thank you and see you all next time. Stay healthy and stay safe.