Development and Validation of Real-World Endpoints
Several speakers, including Sean Khozin of the FDA’s Oncology Center of Excellence, discuss the challenges as well as the progress in generating reliable endpoints from real-world datasets at the 2018 Flatiron Research Summit.
Of note, Dr. Khozin proposed a framework for contextualizing the many endpoints available to researchers working with real-world data. Rather than holding each endpoint to the same standard, he proposed thinking about them in three categories: “validated”, “reasonably likely”, and “candidate”. While many real-world endpoints fall into the “candidate” category, he acknowledges that endpoints in the middle category – “reasonably likely” (e.g. Response Rate) – may be suitable for use in accelerated approvals.
Aracelis Torres: My name is Aracelis Torres. I'm a Senior Quantitative Scientist at Flatiron. And as was noted at this morning's keynote, we are trying to demystify some of that process between the source evidence and the creation of our underlying datasets.
So first for the primer, what is it that we mean when we start talking about real-world endpoints in oncology? The first thing to note is that there is no single endpoint that we leverage in analysis and in decision making. This is just a spectrum of the types of endpoints that we are discussing. It is not just mortality, it is also assessing and evaluating tumor based burden. And even understanding, how is the patient feeling or reporting how they are feeling. And as we start thinking about real-world endpoints, we need to discuss them with the backdrop of, how do they compare relative to clinical trial endpoints. What is it that makes them so different? What are some key and important considerations that we should keep in the back of our minds when analyzing them and also in leveraging specific use cases.
One key thing to keep in mind when comparing real-world endpoints with trial endpoints is, in the real world, patients come in to see the physician when they don't feel well. Relative to trials, there's this very specific cadence and expectation that we will likely see them every few weeks, as defined by the protocol. In the real world, we'll also expect that certain patient data elements won't be collected unless the physician really needs it to really inform how they're caring for that particular patient. Well, in the trial, again, it's all defined by specific protocol. You can likely see increased completeness of things like ECoG or lab capture to ensure that that is all available for analysis of trial data. In the real world, there's also variability by clinician interpretation. A physician is looking at the totality of the patient's care, how they're feeling when they step into the office. While trial data is more often than not, evaluated by a centralized investigator and applying the same criteria across all assessments. And in the specific example of progression, it is often determined by a variety of factors by the treating physician. While trial data predominantly leverages RECIST criteria.
I'll spend another minute or two specifically, on RECIST and why it's currently so challenging to understand it in the real world. One portion of it is, there's a technical problem or a technical infrastructure challenge. That images tend to live in a very specific location, separate from the electronic health records. And the time it'll take to think through, how can we meld all of that information together, it's certainly an uphill battle, but certainly not insurmountable. But in addition to there being a technical challenge, there is also completeness issues. More often than not, the imaging reports themselves will not include all the specifics on RECIST to be able to calculate it. And finally there is a longitudinal aspect to it. Radiologists may not be able to access prior images to understand, has the disease changed in any way, with respect to its size, relative to the last time the patient was scanned? So as of today, data for RECIST is not currently routinely captured in real-world care. And certainly, in the meantime, while we figure out how to incorporate more scanning images into our assessments of real-world endpoints, the patients are waiting for some decision making and we need to understand, what are the key components that we can leverage to help inform the effectiveness of therapies in the real world.
So why the need for real-world endpoints? It is traditionally known that clinical trials themselves tend to be fairly slow, pretty expensive, but more importantly, they offer very limited external validity. The patients that make it into trials tend to not be generalizable to all patients that are afflicted with cancer, throughout not only the United States, but across the globe. It is through the real-world datasets that we really understand post-approval. How will this therapy perform in a patient with various comorbid conditions, with liver and renal functions? The type of patient that would not have made it into a clinical trial. And I should note that there's not just a need for real-world endpoints, there is a need specifically for reliable real-world endpoints. It is this need for reliable real-world endpoints that has often been the reason why observational and retrospective studies tend to be seen with some skepticism. How can I be sure that what I am measuring is truly the outcome of interest? How can I understand or get a better grasp of understanding, can I make decisions with it? And what will be critical in the utilization of real-world endpoints is that we characterize the quality of those outcomes and understand the specific use cases that they can be leveraged in.
So in this need and development of real-world endpoints, we need to start evaluating the current spectrum of endpoints that are available, but also look ahead towards the development of new ones. How can we augment current mortality information with external datasets? How can we get a better understanding of treatment-based endpoints? Are there certain treatment settings or disease settings where they are informative and can give us a sense of effectiveness of therapies? How do we continue the path forward to understanding the real-world version of progression instead of RECIST? However, in parallel, still trying to solve the retrospective and prospective issue of images. And then finally, what does the future hold for patient reported outcomes? How can we continue to think about incorporating them at scale and also analyzing them? I should note that all of this should not happen in a vacuum.
We also need to provide analytic guidance with the information of what went into the development of those endpoints. Otherwise, analysis may erroneously include other data points in tandem with those assessments. And without the information of its development, we run the risk of having estimates that are yielding invalid inferences. And we're all here to today because at the end, we care about the patient and we want to make sure that those estimates are correct to inform correct decision making. And keeping all of this in the back of your minds, I'll hand it over to Sean, who will provide the regulatory perspective on real world endpoints, thank you.
Sean Khozin: Thank you, great to be back. And I'd like to talk about just an organizing framework for evaluating real-world endpoints. The topic is a very complex one, like all the other topics that have been discussed today. And a very nuanced one as well. And I believe it helps a lot to have a foundation based on prior experience, but also emerging regulatory concepts that can help us move forward, in terms of thinking about real-world endpoints and how to increase our level of confidence in these endpoints, in terms of their validity and precision.
So, I'm gonna step back and start from the basics, in terms of clinical trial endpoints. What are they? We all intuitively know this and there is a regulatory definition for a clinical trial endpoint. Which says that, endpoints are essentially measurements that capture outcomes of interest in a clinical trial. And these measurements can be laboratory measurements, tumor measurements, and a variety of different measurements that can be captured in the clinical setting. And when it comes to the different types of measurements or clinical endpoints that we typically use in clinical trials for regulatory decision making, generally speaking, there are two different types. There are direct measurements and indirect measurements.
Direct measurements are essentially clinical outcomes and these are measurements that directly measure an outcome that's clinically meaningful in terms of how the patient is feeling and functioning or living, in terms of survival. So, overall survival is a direct measurement. And when it comes to evaluating these endpoints in regulatory decision making, typically these endpoints are used for traditional approval decisions. And traditional approval is a new term for regular approval.
And there are also indirect measurements and we call those surrogates. And these are very important endpoints that have been used over the past couple of decades, increasingly in oncology clinical trials and what's interesting is that based on the latest comprehensive data that we have available to us between 2010 and 2012, nearly half of FDA approvals were based on indirect surrogate endpoints. So these endpoints are very important in clinical development and are used very widely. And when it comes to surrogate endpoints, there are a variety of different surrogates and each one is positioned along it's continuum of validation, from a candidate surrogate endpoint all the way to a validated endpoint. So the most concrete and the most, let's say, trustworthy surrogate endpoint is a surrogate endpoint that's validated. And what that means in the regulatory sense, is that it's an endpoint that's supported by a clear mechanistic rationale and clinical data that provides very strong evidence that an effect on the surrogate endpoint can directly predict a specific and clinically meaningful benefit. That is what we call a validated surrogate endpoint.
There's a second category, which is called reasonably likely and these surrogate endpoints are typically used for making accelerated approval decisions. Overall response rate belongs to this category. It's a surrogate endpoint that's reasonably likely to predict clinical benefit. And looking at it this way, which is the technical definition, overall response rate isn't really a validated endpoint, it's a reasonably likely endpoint that the community feels very comfortable with, in terms of predicting clinically meaningful outcome. And the third category is a candidate surrogate endpoint. Some of the real-world endpoints that we are examining today belong to that category. Those are candidate endpoints that show a lot of promise in terms of giving us information about a meaningful clinical outcome.
And the 21st Century Cures Act, which has come up several times today, asked the FDA, actually mandated the FDA to come up with a list of all the surrogate endpoints that the agency has used in making drug and biologic approval decisions. And here is just a snapshot of some of the oncology-related surrogate endpoints, and have the users link underneath that has the full list, which is actually downloadable. And as you can see, there are a number of surrogate endpoints that have been used and I already mentioned that overall response rate is typically used for making accelerated approval decisions. And that's a very robust, if you will, surrogate endpoint that the community, based on community consensus and the experience that we've had at the FDA, is reasonably likely to predict clinical benefit. So it's been used for making accelerated approval decisions. There are obviously serum biomarkers and there are different ways of validating these biomarkers. And with these surrogate endpoints, if these endpoints are being used as the primary endpoint, then the performance characteristics of the vehicle or the test that's being used to generate these endpoints becomes very important.
And there's a biomarker qualification program at the FDA, and that's one route to qualify and validate these biomarkers. But typically biomarkers don't go through a qualification program, they're incorporated through clinical development programs and the analytical validation, and the clinical validation is done as part of the development program. And there are cytogenic composite endpoints that are typically used, in this case, this is CML, and they are timed to event endpoints, this is event free survival that has been used in the past for both accelerated and traditional approval decisions. So the number of surrogate endpoints and there are a variety of different ways of validating these endpoints and there are different validation types.
So, there's obviously analytical validation. And that speaks to the technical performance of the endpoint or the mechanism through which the endpoint is generated. So if the data is being captured using data from electronic health records, and that data also is combined with other data sources, the way that those dots are connected, the technical performance of that measurement is the analytical validity. And this concept is very close to, also the same principles that we use for companion diagnostic assays. That the performance characteristics of the assay itself can be analytically and clinically validated. And analytical validation speaks to the technical performance. And in this case, the technical performance speaks to how the data was gathered, the audit trails, and the reproducibility, for example, if the same technical measures are deployed on similar datasets. And analytical validity of course, doesn't tell you anything about whether that endpoint itself is clinically useful or meaningful and has clinical validation.
All the real-world endpoints that we're talking about today have correlates in traditional clinical trials, that have already been clinically validated. For example, progression free survival. That's an endpoint that we know is useful and is clinically meaningful. So, that makes the clinical validation piece somewhat easier, however the technical performance still is very critical to make sure that real-world progression free survival closely approximates progression free survival in traditional clinical trials. And that's a function of the technical performance of that measurement.
So, one way that we can characterize real-world endpoints, and I believe this is very critical as we move forward, we need to have a mechanism, a framework, that can harmonize the nomenclature and can basically move everyone in the same direction and making sure that we're speaking about the same things. And one of the best ways to categorize real-world endpoints is to look at these endpoints as drug development tools. Drug development tools are a new phrase, that's a new phrase that was introduced by the 21st Century Cures Act and in fact, congress directed the FDA to develop a mechanism for designing, validating, and qualifying drug development tools. And the way that drug development tools are defined in the 21st Century Cures Act is as follows, a drug development tool is a biomarker, a clinical assessment, and any other method, material, or measure that the Secretary determines can aide drug development. So essentially, real-world endpoints would fall into and under this category, and this is also the category that we're using to validate algorithms, AI algorithms. Many of these algorithms are being incorporated as software, as medical device solutions, as many of you know, and the way that we are approaching the validation, the clinical and the technical validation, of these algorithms are through the drug development tool pathway. And real-world endpoints have and can fall under the same pathway.
And I think that opens up a new area of exploration and discussion and very interesting opportunities in terms of how to systematically approach defining and validating real-world endpoints. And in many cases these endpoints can be thought of as drug development tools for clinical outcome assessment. And many of us are familiar with clinical outcome assessment tools and PRO's are part of that and when it comes to looking at these clinical assessment tools, the same validation principles apply. The language is a little different but a lot of the concepts are similar. And for clinical assessment, we have construct validity, which is based on quantitative methods to make sure that the endpoints and the methodology that is used, quantitatively, to describe these endpoints, align with a pre-specified hypothesis.
And there's also content validation for clinical assessment tools that speaks to the use of qualitative research methods to make sure that the concept of interest, including evidence that the domains that are being used to define these endpoints are appropriate and are comprehensive, relative to the intended measurement concept, the population and the intended use, and this is sometimes called face validity. And I believe these are the nomenclature that we can use to anchor the discussions that we have about real-world data endpoints. And with that, I'd like to welcome Dr. Abernethy back to the stage.
Amy Abernethy: I always want to find out what music I have. And I don't know about you, but I've been dying to have one of these cute little glasses of water. So, we're gonna move on to validation. So those of you who I did not meet earlier, my name is Amy Abernethy, I'm the Chief Medical and Chief Scientific Officer here at Flatiron and we're gonna talk about how we've been thinking about validating real-world endpoints.
Importantly, as I get into this, I want to hit on the fact that this is a straw person. We've really tried to come at this by putting something down on paper and sharing it around so that we can all reflect on it, refine it, and move the whole space forward. So, share it around, refine it, help it get better, please. We're gonna talk today about two exemplar endpoints, mortality and tumor burden, specifically, tumor progression. And as we think at Flatiron about endpoint validation, we use the same overarching framework that we use in any of our R&D activities inside of Flatiron.
First of all, we think about what's the source data we're gonna go to in order to generate the endpoint of interest. The second thing we think about is the processes, for example, how we're gonna abstract from the electronic health record when we need a detailed abstraction from unstructured documents, and what the processes and procedures are, how they're documented, how they're controlled, how they're versioned. We think about analytical approaches to do this in a systematic way, and again, how that is documented and shared and made available to you. For example, if you work with Flatiron data it often comes to you in terms of analytic guidance, and then we think about it in terms of validation approaches. How do we systematically evaluate the data points themselves, understand reliability and validity, and get that information out to you so that you have it at your fingertips in the context of doing your research.
When we think about developing endpoints inside of Flatiron, we also apply another core Flatiron key principle, which is that of iteration. Develop something, pressure test it, see what the results are after that pressure test and then figure out if we need to go back to the drawing board, or do some simple improvements until we get to the place where we think something's ready to go. And we do the same thing with endpoints. So we define an endpoint, we think about how we're gonna go to the source data, how we're gonna curate it, we then develop a validation approach, which I think of it like an assay to assess the endpoint, we apply that assay and look at the results, we go back and see whether or not we need to refine the curation and then we continue that loop until the endpoint is ready to be put into the dataset. So what's the assay? So inside of Flatiron, we call that assay the validation process for endpoints. And what's interesting, when I look at our validation process, and many of you can even remember the days on the whiteboards where we wrote many of these sections down, we developed the validation process using the backbone of the patient reported outcomes of the validation process.
So the nice story here is, it lines up with what Sean said just a couple minutes ago. The first part of the validation process is face validity, which aligns with the language of content validity that Sean just mentioned, which is, does this particular endpoint make sense to end users, regulators, clinicians, researchers, the public at large, as we think about what we're doing. The second is internal validity and reliability. In particular, within the context of curated data, we think about data completeness, we think about if the data has been abstracted, do two abstracters get reasonably the same answer when they look at a chart and abstract a piece of information. So inter-rater reliability or inter-rater agreement and accuracy. And then external validity, that construct validity that Sean mentioned. So, a series of experiments that asks, does this endpoint perform as we would expect it to perform for the task at hand? And those are the three elements of our overarching framework, and then for different endpoints, we, within each of these three sections, refine it to match the endpoint of interest.
So let me take you through how that works with two case studies. The first is mortality. Now, you might be saying to yourself, come on. Like, it's binary, it's like one of the things we could most know. A person is alive or dead. We don't like to think about it that way but it's indeed the truth. But actually, do you think about it in the clinic? When the person passes away, I often don't have a reason to go back and open the medical record, or type in and open the medical record, and put that information into the chart. So many times, information about mortality in real world clinical settings is missing in the electronic health record. Meanwhile, it is a very critical data point for almost all of our analyses.
There are other places we might get mortality data. For example, the Social Security Death Index. And we think to ourselves, well of course, the SSDI, that's where we go. But look at this, what you can see is between 2011 and 2016, the number of Social Security Death Index reported deaths was going down. Does that mean that less people in the United States were dying and that it was decaying at that rate? No, it means that the laws changed and fundamentally states don't need to report now in the same way that they did in 2011. And what we see is that states are reporting differently. And so, depending on what state or locale you come from, the likelihood that the SSDI is gonna be a reliable source of death data changes. And it's also, lagging behind. So it's not as recent as you might need for the kind of real-world analysis that we're talking about. There is a fantastic national dataset that's very complete, the National Death Index, but it's only available for use in certain circumstances, and it's only available once a year and then it lags by about 12 to 18 months. So, interestingly we can think about the National Death Index as a benchmark, complete and accurate. But we can't think about it as a real-world data source to use in all of our datasets.
So the way that we approach this at Flatiron is first, we think about, how do we develop a mortality endpoint and then how do we have our assay to evaluate that mortality endpoint? In terms of developing that mortality endpoint, just like I said before, we do this in an iterative way, we developed version one, we're now working on version two, we're now on version two and we're working on version three. In the beginning, what we did was we took structured death data that was available in the electronic health record for when it was available, for the times when the oncologist typed it in, as well as linking that to a commercially available dataset and we put the two together. We then took that National Death Index data, and we did this together with colleagues at Roche / Genentech, and took it and used the National Index as a gold standard data source to benchmark the mortality data, and what did we find? It wasn't good enough. The sensitivity and the specificity, I'll show you in just a second, was not high enough to have confidence in using this in the kind of analysis we're talking about.
So we went back to the drawing board and we said, okay, we need to take what we've already done in version one, and get it better. So we enhanced it in two ways, includes the Social Security Death Index, and we also now, go in and abstract death data from the chart for times when it's available in an unstructured document but not necessarily in a structured field. And that now becomes Death Data 2.0. So many of you have death in your datasets, and that's the dataset we're talking about right now, and again, we verify it against that National Death Index as the gold standard to be able to document sensitivity and specificity and we version it and publish it so it's available for you. And this was published in Health Services Research earlier this year.
Where is it heading? We need to continue to enhance and get it better. So we continue to look for new data sources, we are thinking about, how do we improve the electronic health record and clinical workflows to enhance data collection directly in the clinic. And all of these will continue to be pressure tested each year as new versions of the National Death Index become available. What does it look like? Well, this is non-small cell lung cancer. And I'm standing right in front of it so I'm going to go over here, and if you look at non-small cell lung cancer what you see in the blue is the Social Security Death Index, what you see in the purple is version one, and what you see in the red is version two.
If we look at just the sensitivity of the mortality data from just the Social Security Death Index, that's remarkable, it's 30%. Means 70% of the death are missing if you just use SSDI in your real-world datasets. If we take our first version of the amalgamated data, this was structured data in the EHR, plus a commercially available dataset, sensitivity goes up to 79%. If we now have commercially available, structured EHR, abstracted EHR, and Social Security Index and put them all together in a new amalgamated dataset, the sensitivity is now up to 90%. Specificity is already pretty high and remains high, 'cause if the person is reported as dead they're most likely to indeed have passed away. And accuracy is again high. And we can think about this by disease. So now, if we take a look at metastatic colorectal cancer, breast cancer, and melanoma, we can see the differences in terms of the ranges in sensitivity, specificity and accuracy. And it's all in pretty much the same range so that most of the datasets that we've got have sensitivities running right around 90% for the mortality data.
But now you know what you're working with. Not only do you know what you're working with, you can actually think about what does that mean. As I look at, for example, a point estimate for overall survival coming out of my results. How might that point estimate be inflated because of missing mortality data and how do I need to think about that and plan for it in my analysis. And also other questions, such as, some sites have really really high reporting in the electronic health record. Can I do sensitivity analysis, the focus on high reporting sites or high reporting states, so that I can understand the distribution of the results. And then, as I talked about before, that process of iteration, continuously updating and going back. So we have now, our mortality dataset that's continuing to grow, an annual update of the National Death Index that we use to then annually update our understanding of sensitivity and specificity. Updating our quality reporting so that when you see a report, it goes with the version of the dataset and you understand what you're working with, and then that goes into analytic guidance that we provide to you about how to work with the data. We're not sure that this is the right way to do it. This is a best guess for right now.
So again, as a community of researchers and thinkers working together, we ask of you to help this process get better. So what about progression? As Sean mentioned, progression is something that means a lot within the context of our datasets. As I think about tumor progression and for example, PFS, it helps me understand how to have a conversation with a patient in the clinic as well as how to understand the context of what might happen in this person's life with this disease. However, if I think about being able to curate progression out of the electronic health record, we've got to come up with some kind of systematic framework to look at the electronic health record, pull out documents in a systematic way, I mean pull out data points in a systematic way, turn that into an amalgamated construct that we call progression, and now put that into a dataset so progression free survival can be calculated.
And so, as we think about that, not only do we have to think about how are the data going to be captured, we then want to make sure that we have to go along with a validation framework that allows us to assess how good of a job are we doing at abstracting that information and what do we need to do to continue to update it across time. Same conceptual framework as you saw for mortality. The complexities of real-world data, I think, are becoming really obvious to you as you do the abstraction activity outside and other things, but ultimately you've got to deal with the ambiguity in radiology reports and interpretation. The fact that patients may be assessed at variable time points across their clinical care. There's clinical nuances, such as things as pseudoprogression and mixed response and we got to think about how we deal with that both in terms of terminology and language, but also, how do we translate that into data points that you can understand.
And then as Flatiron, we can't just do this internally. We need to be able to develop an approach that other companies who are curating data can use, because if it's not transparent then it's actually not gonna be something people can trust. We gotta make sure that it's portable and replicable. And we gotta make sure scalable. If we can only do it on 30 patients, that's gonna work very well, we need to be able to do it on 300,000 patients. When we think about the radiology reports and the clinical visit notes and how this ultimately all comes together to develop a progression endpoint, you've seen across the day, this kind of concept of a patient journey and the events that happens to a patient across time and each of these events have associated documents.
So for example, this particular data point that's shown here in red, after the patient had progressed, has both radiology document, this is the PET scan, as well as a clinical case note that documents overall progressive disease. I promise you it is not always this clear. And so, ultimately, what we need to do is come up with the mechanisms to help abstracters confidently pull this information out of a chart. We did a series of experiments. Literally we experimented with, do we give them this document first? Do we ask this question first? Should we do it all at once or do we do it in a set of different sessions so that people don't get tired? And we have actually submitted this work for publication and one of the things that we've found, was that if we anchor the abstracters on what the clinicians say first, they do their work more efficiently, more consistently, and ultimately what we get is more reliable results. Again, using that same concept, of the validation framework, we can now look at the metrics around progression.
So, one of the metrics that we look at is the feasibility and quality of abstraction, this is that reliability metric, and the completeness of collected data. And one of the aspects, for example, is whether or not patients who have second line therapy received that second line therapy within the context of any prior progression event. Because we would think most patients had a progression event before they went on second line, although some patients may have gone on to second line therapy for toxicity or other reasons. We can go back and look in the chart and read the case notes and understand why they really go onto second line to increase our confirmation and confidence with the results, but these are the kinds of metrics we develop that we can then hand to you so that we can have a conversation about what does this data point mean.
We look at inter-abstracter agreement and whether or not abstracters agree that a progression event happened, that's seen about 94% of the time, and do they agree on the date? Do they agree on the date within a specific range, like within 15 days or within 30 days. And you see that the 30 day date agreement for non-small cell lung cancer, is 85%. Similarly we need to look at validity of outputs. This is that construct validity that you were hearing about from Sean. And one example of this is the likelihood of a progression event being associated with a downstream event, like treatment change, going to hospice or mortality, and that's happening about 65%. And when we don't see that downstream event, we go back and look at the charts and say, what in the world's going on and why might we not see an event happening within this period of time.
We can also then take a look at, how do these same metrics line up disease by disease? We can now go back and try and figure out, can we improve our policies and procedures or abstraction processes to enhance these metrics? And we can also think about, are there other datasets we need to bring in to make this better across time? For example, imaging data, as you've heard about. And so, as I think about this, what you see us doing at Flatiron, is using right now, lung cancer as our primary benchmark or starting point for developing progression endpoints. We've now developed them across prostate, breast, melanoma, small-cell and renal-cell, and continue to work on three more diseases in 2019. But the next huge task that Mitch is gonna tell you about is moving from developing these endpoints and developing this validation documentation to putting the endpoints into a series of experiments and understanding how they're performing in those experiments. We do experiments inside of Flatiron, but the most important experiments are done by our partners and all the community that's here in this room, and one of those partners is Mitch Higashi from BMS. Thank you very much.
Mitch Higashi: Thank you very much. It is a great honor to be here, thanks so much. Last month, October 2018, the Nobel Prize in Medicine was awarded to two cancer researchers, right? Dr. Allison at MD Anderson here in the U.S. and Dr. Honjo in Japan for their breakthrough work in immunotherapy. Understanding how the immune system is activated and works against cancer cells. The Nobel Committee called this quote, “an entirely new principle in how we treat cancer”. Why am I saying this? We are all now connected to this new principle in how we treat cancer, okay? Real-world evidence will feature prominently and critically in how we all go forward. All of us from different disciplines, all of us as different stakeholders, working together in real-world data to make advances in cancer.
So our story began with YERVOY in 2011, our first approval in immunotherapy for advanced melanoma. And so YERVOY, a CTLA-4 inhibitor, and our story continued in 2014 with the approval of NIVO or OPDIVO, also in advanced melanoma. And so, the idea here is that cancer, one of the ways that cancer allows itself to grow is by inhibiting the braking system, right, so if you think about a T-cell? What cancer wants to do is activate that braking system, okay, so CTLA-4 wants to activate the braking system and prevent the proliferation of the T-cells. It wants to activate the braking system of PD-1, prevent the immune system from recognizing the tumor cells and attacking them, okay? So, after 2014 we had a pretty simple working theory, which was combine both drugs. Combine both drugs, in our first and deepest area of expertise, which was advanced melanoma. And the idea was, shut down the tumors ability to deactivate these braking systems and basically make it harder and harder for the tumor to evade. Make it harder and harder for the tumor to activate it's cloaking device.
Now, in our opening remarks from Zach, he talked about out exponential growth in understanding in this field. Basically, we are doubling our knowledge, doubling our understanding every year now. What that also means, is that last year, we knew about half as much as we know this year. So, heading into ESMO 2017, our story begins with an interim result. So this is the presentation of at least three years follow up. What this means is that every patient in this trial has at least two years of follow up for overall survival. And so, you can see here on the left hand side of the slide, promising, right? Promising. You see a numerical trend. You see a signal that the IO combination appears or suggests that it is of more durable response than either NIVO, that's OPDIVO monotherapy, or IPI, which is the YERVOY monotherapy. We don't have statistical significance but it is suggestive.
However, on the right hand side of the slide, you now see a new question. And this was a big question for us. What you see is that in the U.S. population versus the European trial results, you have a statistically significant overall survival. And we are immediately faced with a question. Okay, what is going on? Can we validate this in real-world data? We have to try to validate it because we need to develop new working theories. Right, so why is it that the U.S. population could respond better in a randomized trial versus the European population? So we immediately turn to Flatiron. I won't go into great detail on this slide, Amy did a wonderful job taking you through the methods, but the reason why we were confident in partnering with Flatiron, both for overall survival in a real-world mortality endpoint, and also progression free survival.
The way I would summarize this in one word is, transparency. Transparency, very honest, open, and transparent with us about, look, here's where we are with version 2.0, work in progress but this is what we've got so far, we're using the SSDI, we're augmenting it with obituaries, we're using algorithms to link it to the EMR, and we're trying to validate this with the National Death Index. And similarly, with progression free survival, Amy alluded briefly to the Health Affairs Paper. So both Sean and Amy are co-authors on this paper that talk very openly about their methods for inter-rater reliability and intra-rater reliability. So, inter-rater reliability, two different abstracters can look at the same chart and come to the same conclusion. Intra-rater, meaning that one abstracter can look at that same chart at different points in time, and come to the same conclusion.
So, heading in ESMO 2018, this is what we are able to present from the Flatiron study. The first thing that you see here is that we have raised the bar, because with Flatiron we are now able to track and evolve our understanding about standard of care. So we actually wanted to benchmark against OPDIVO monotherapy, right, because that was becoming the new, if you will, predominant standard of care. What you see here is that the IO, IO combination is more durable and is the most durable of the response curves. Similarly, what you see here with overall survival, overall survival for the IO, IO combination appears more durable than the NIVO monotherapy.
Now, perhaps most importantly, I'll say two things. What we have seen most recently this year at ESMO from our report out of the randomized trial, so this is the extension of that three year followup, into four year followup, we see essentially the same thing, that the IO, IO combination produces the most durable OS response. And that appears to be supported here in the Flatiron study. Perhaps, more important for us, as outcomes researchers, as explorers of real-world evidence, what we were able to see was that in the European trial population, they had more advanced or a higher burden of disease so they had a higher proportion of distant metastasis. That higher proportion of distant metastasis was different, was higher than the U.S. trial population and the U.S. Flatiron population. So, we now have a new understanding, a new approach, a new way to think about burden of disease, and how patients are going to progress. So with that, I believe that is my last slide. So we have some time to call up the panel. Just like to acknowledge all of our collaborators, both at Flatiron and our co-authors for this important work, thank you.
Aracelis Torres: Certainly the timeliest set of presentations I've ever seen. So, thank you everyone. So, we'll kick things off with our moderated panel discussion and reflecting, in particular, on all the talks we have just heard. And, we've seen there's been a broad increase in uptake of real-world data, and in turn also real-world endpoints, and their utilization for specific use cases. So, hopefully from this discussion, we'll get a better understanding of, how do we begin to establish quality standards, with respect to endpoints and maybe vary by use case. And also discuss, where do we go from here now that we've sort of laid where, present day, what the frontier tells us, but what again is lying ahead.
So, I'll pose the first question to Sean. So, we saw in Amy's presentation a number of metrics shared, especially with respect to performance metrics for small-cell lung cancer and advanced non-small cell lung cancer. Curious as to your reaction and thoughts upon seeing some of those metrics across a number of domains.
Sean Khozin: Well, you know there are some metrics that obviously are more difficult to measure and require analytical validity, but it's important to recognize that with any metric, including how we assess RECIST, tumor responses in clinical trials, traditional clinical trials, there's a lot of volatility, and that's something that we have intuitively recognized. And for example, you know, if you look at phase three registrational studies that incorporate tumor-based endpoints, progression free survival, for example, you know, we require an independent radiology review assessment. The reason for that is that we'd like to have a second opinion, a second look at the images. And over the years we knew that there was a discordance, i.e. volatility in estimation of tumor response and more recently, we did a meta-analysis and that volatility as a discordance is about 35%.
So these are registrational studies, highly controlled, highly trained professional radiologists. And two radiologists looking at the same image come up with two different assessments, and that's after categorization into RECIST, which has a 50% margin of error already built in. You know we don't call anything a response unless a tumor shrinks more than 30% from baseline and nothing is progression unless it grows more than 20%, so that's 50% margin of error because of human visual inspection. That's why it was built in. If we look at, you know, how radiologists measure, actually, the longest diameter of the lesions, that discordance is much higher.
So, the moral of the story is that, there's a lot of volatility in how we assess what we believe to be the gold standard. Whether that's clinically meaningful or not, that's a separate discussion, and I think we can also, we can learn from that foundation when we think about real-world endpoints and how to think about the metrics and how to assess what is really good enough, if you will.
Amy Abernethy: You know, I just want to kind of comment on what Sean just said. I think, as you going to the moral of the story, the moral of the story is, it's all messy, right? And that's acknowledging it's messy is kind of the first part of the task and then figuring out how we're gonna be transparent about the messiness and work our way through it is important. And one of the things that strikes me about RECIST is it's a consistent framework to have a consistent approach to deal with the messiness. And similarly, I think in real-world endpoints, what we need is a consistent framework that allows us then to work our way through the practical reality of this is a messy area.
Aracelis Torres: And Mitch, I guess from your perspective, as you're thinking through potential use cases of how to leverage a real world dataset for some of the analysis you showed. How are you assessing the trade off of how good is good enough?
Mitch Higashi: Sure, so, you know, I showed this example from melanoma, how we have reasonable confidence in overall survival and that study is one case, there's a lot of great work going on with Flatiron. So, Aracelis, she's doing work in lung cancer to look at the effect of sensoring and potential bias in overall survival and how that could change the hazard ratio. So, Aracelis and the team are pioneering some methods there. Carrie Bennette and the Flatiron team are looking at ways to use real-world data and statistical modeling to essentially plot an external control arm in measuring and estimating overall survival.
So all of this stuff together combined, it's kind of like the, if you will, the collection of evidence and methods that's coming together to give us a lot of confidence with overall survival. Progression free survival, I see it again as this inter-rater and intra-rater reliability that gives us confidence there. But I also agree with a lot of Sean's comments about RECIST. Craig, in the previous session, talked about quantitative RECIST and I think there's something there for us to explore and get better at quantitative measures to essentially score and add quantitative information to what is effectively a measure of measurable disease. Aracelis Torres: And I know Mitch, in your particular example, you talked through that reproducibility, replicability is one potential signifier or qualifier of, is this potentially good enough. Are there other, either Mitch or Amy, thoughts on additional signifier's of data quality with respect to real world endpoints?
Amy Abernethy: Do you want to start Mitch?
Mitch Higashi: Sure. I can go first. Look, I think we're getting to a place where we're going to see mortality as a domain and mortality surrogates to help validate that domain and I also think a second one is patient reported outcomes, right? The idea that a patient feels less sick and can schedule their work hours with more certainty is not to be taken lightly. I think that deserves its own consideration as a domain and I think more and more surrogates need to come into play to validate that domain.
Amy Abernethy: You know, I'll add to some of what Mitch was just saying, a couple of things come to mind. First of all, the ability to bring together datasets, and Mitch, you just mentioned, for example, patient reported outcomes, I think we need to think broadly. For example, one of the things we need to think about how to bring in, is claims data. So really bringing together multiple datasets, that's the first part, the second part then becomes, we need to think about essentially portfolios or packages of endpoints, because in order to tell the complete story you need to be able to see it from all sides. You need to be able to see what's the impact on disease burden and mortality, but you also need to be able to understand and what's the balance in terms of toxicity, safety, what's the balance in patient experience and frankly, what's the balance on our economy and healthcare system as a whole, so health resource utilization outcomes are very important there.
The last thing I'll say, and I've been sort of thinking about this entire session, and based on Sean's talk, is triangulation. You know, endpoints that, our ability to understand how valid our endpoint itself is, can be triangulated from multiple areas. And then also, having endpoints that tell the same story from multiple angles becomes another part of what we need to be thinking about in the future.
Aracelis Torres: Any additional thoughts Sean?
Sean Khozin: I agree with all the statements. And I think I'd like to, on the score of what was articulated, that we can have this very logical approach to how we look at endpoints in the real world and starting with concepts that we're comfortable with. Progression free survival, overall survival and then you move into patient reported outcomes. And as we all know, there are many emerging ways of collecting patient experience data using sensors and wearables, smart watches.
I've recently discovered after getting this gadget that my resting heart rate is typically less than, very low, it's about 50 and obviously that's not bradycardic, that's my n-of-1 normal. And we can all envision scenarios, those of us who have been in, for example, in situations in a clinical setting where we're monitoring the patient in say, the ICU, the heart rate is low, you're puzzled, should I re-dose the beta blocker? That could be that patient's n-of-1 normal. So to be able to go beyond what's feasible and try to trap patient experience data using these emerging modalities, that's something that has a lot of value and can open up a whole new realm of opportunities and possibilities. And phase approach, you know, starting with what we know, what we're comfortable with, all the way to sensors and wearables. Really spells out a very exciting path forward.
Aracelis Torres: And I've also heard this trend about, the importance of being transparent, what's under the hood, as noted again from the keynote. One of the challenges is, sort of, the publication cycle itself is very long and the use cases and the questions that sponsors and users want answered tend to be, as of today, how can I use this, what are the metrics you can share? Are there any sort of brainstorming that we can do or think about, how to get the information out there while the publication cycle tries to catch up? Obviously, we have the mortality publication, something we could easily reference, but the progression paper's still making its way through that, sort of, peer reviewed process.
Mitch Higashi: Well I'll go first. I think this forum is an excellent example of bringing us together, right, get a better understanding of the evolving methods and how we can apply them. And I'll say, look, you know, we're committed to publishing our research and we're on a publication path, but there has to be a way for us to understand how the methods are evolving, how to apply them and I mean, this is one great forum to do it.
Aracelis Torres: Any thoughts, Amy or Sean, to add?
Sean Khozin: Somebody should start a new journal, Journal of Real World Evidence.
Mitch Higashi: Editor and Chief.
Amy Abernethy: I think, one other thing, since I'm going to use this as my forum to push one of my things forward, I think we need versioning, I think we need to be able to see it on the web and other places where you can clearly see with documentation, this is the version and this is the information that goes along with it. We try and figure out how to do this at Flatiron but it's still, we still live in a publication focused world and figuring out, how do we have confidence with the information that we see on the web and we can all see it in a transparent way, I think is one way to get there.
Aracelis Torres: And we'll end on one last question, where do we go from here? What does the future of real-world endpoints look like, knowing what we know now? I guess we'll start with Amy, then Mitch and end with Sean.
Amy Abernethy: I think that really, ultimately, we need to get to a place where we are continuously thinking about what endpoints tell the full story. But we have the confidence of the endpoints that are in our datasets, so we're not really spending our time trying to figure out, how do we develop endpoints and are really thinking about what are the results and what's the story it tells us. So I think we're in a developmental step right now.
Mitch Higashi: Yeah, I think in the surrogate space, that's where we'll see more innovation and more disruption, right. And as these ideas evolve and more consensus evolves around these different measures, real-world treatment response, does that need to validated against RECIST? Is this a new type of endpoint? Where does it fall on the hierarchy? I think these are some of the questions we have.
Aracelis Torres: Sean?
Sean Khozin: I'll say that, don't be afraid to try new things and bold adventuresome things, and consider the FDA a friend, because we like to figure out ways of understanding a patient experience better and more holistically using a variety of different data types. And turning the focus towards the real world, you know, where the majority of adult cancer patients are being treated.
Amy Abernethy: I just want to say something about that, we've gotten so many great ideas in working together with the FDA, as well as working together with all of you, that I really just wanna, kind of, underscore that working with the FDA as friends has been a really helpful way of moving this forward, so thank you.
Aracelis Torres: Well I certainly want to thank Amy, Sean, and Mitch. Certainly a lot to think about and move forward. How do we, sort of, take the next step with respect to real-world endpoints. Certainly just the beginning, not the culmination of anything in any sort of realm of possibility. So thank you again, and hopefully everyone, sort of, got a lot from the presentations. I will note that some of the panelists will be joining us at Ask the Experts, so please find your way there, if there are any lingering questions you have or topics that you want to delve a little more deeply into outside of this presentation. So, thank you everyone for coming and thank you for participating.
Flatiron Research Summit
November 7, 2018
- Aracelis Torres, MPH, PhD — Senior Quantitative Scientist at Flatiron Health
- Amy Abernethy, MD, PhD — Former Chief Medical Officer, Chief Scientific Officer & SVP Oncology at Flatiron Health
- Sean Khozin, MD, MPH — Associate Director, Oncology Center of Excellence at FDA
- Mitch Higashi, PhD — Vice President, Health Economics and Outcomes Research at Bristol-Myers Squibb
- An evaluation of the impact of missing deaths on overall survival analyses of advanced non–small cell lung cancer patients conducted in an electronic health records database
- Development and Validation of a High‐Quality Composite Real‐World Mortality Endpoint
- Leveraging Real-World Evidence for Regulatory Use