The Role of Machine Learning & NLP in Real-World Evidence

September 29, 2018

The increase in availability of real-world evidence has been accompanied by the need to improve data processing technology in order to efficiently draw insights from large amounts of heterogeneous data.

In this session at the 2018 Flatiron Research Summit, Flatiron team members discuss the evolving role of machine learning and natural language processing (NLP) in real-world evidence, including Flatiron's guiding principles for applying these techniques.

Transcript

Josh Haimson: So there's a lot of buzz around AI in healthcare. If you look in the popular press or business reports you'll probably see headlines that suggest robot doctors are ready to take over. And depending on your perspective, you're either rolling your eyes at this, thinking this is just another fad that's going to die out once the next trend comes along. Or you're eagerly awaiting your appointment with doctor robot. The reality, as always, is much more nuanced than the headlines will lead you to believe. So in today's session we're hoping to try and demystify AI a little bit. We really want to try and separate the reality from the hype.

We'll share our perspective on where these technologies are useful in generating real-world evidence. But we'll try and avoid the flashy headlines and instead focus on some concrete examples. Before we get into that I want to start with some definitions, because there are a lot of acronyms in this space, and the terms are often confusing and hard to disentangle. And then we'll spend a little bit of time on the principles that guide all work in this space. And then we'll spend the majority of our time on a few case studies illustrating how this can be used in practice. And then a little bit of time looking forward to the future of the technology in this space. And of course, we'll save some time for a Q&A at the end.

So let's start with definitions. There's three terms that I want to define. The first is artificial intelligence, or AI. The second is machine learning, or ML, and the third is natural language processing, or NLP. We'll start with the broadest of these terms, which is AI. So if you look in a textbook, the definition of AI is the development of computer systems that are able to perform tasks that normally require human intelligence. And one thing you might think when you look at this definition is that it's really broad. You have things like visual perception, speech recognition, decision making, translation. In some ways this is actually inevitable.

Marvin Minsky, who was a prolific AI researcher, and actually helped coin the term AI, would call intelligence a suitcase word. It's a word that we stuff with so much meaning that it's actually impossible to define precisely. So then artificial intelligence, the field of making computer systems intelligent, is equally as broad. It's the field of building computers that can do the thousands of things that we call intelligence and more.

When the field first started they focused on things that sounded intelligent. Things like calculus and chess. And they actually turned out to be a little bit easier than one might have expected. In 1961, James Slagle published a PhD thesis on a program that was able to ace a calculus test. And over 20 years ago, we had Deep Blue beat a chess grandmaster. Many of the hardest problems in artificial intelligence are actually not the things you learn in universities and textbooks. They're the things that we all learn in our first few years of life, from the millions of data points around us.

Things like why when I move my hands a certain way I can build a block of towers that stand tall, but if I move them in a slightly different way that block of towers falls down. Or more fundamentally, how to even distinguish what a block is based on the patterns of light hitting my eye. So artificial intelligence as a field is focused on trying to teach computers to do all of these tasks, and the thousands more that we stuff into the suitcase of intelligence.

One of the powerful tools for solving many of these artificial intelligence problems is called machine learning, or ML. ML uses statistics to give computers the ability to learn without having to explicitly programming them how to behave. And this is really powerful, because many of those hard tasks of artificial intelligence are actually really hard to write instructions for. And computers are very literal systems that only follow the exact instructions you give them.

So in order to be able to solve many of these artificial intelligence challenges, machine learning allows us to skip the step of writing instructions and instead collect many examples of the tasks being performed and use statistics to teach a computer to learn how to perform those tasks.

To make this a little more concrete, think about the task of facial recognition, a task we do thousands of times every day. Imagine trying to write a detailed list of instructions on how to look at an image and determine who exactly is in that image. Specifying in excruciating detail what patterns of pixels and colors make up lines and edges, what patterns of lines and edges make up shapes, what patterns of shapes make up faces, what patterns of shapes and textures within those faces make up this unique face of a specific human under all angles and lighting conditions, it quickly becomes an impossibly long list of instructions that would take forever to write.

However, it's actually really easy to collect thousands of images where people have already identified who is in those images. And using machine learning we can use those examples to build models that can learn how to perform this task and skip that really long list of instructions that you would normally have to write.

The last tool I want to talk about for tackling artificial intelligence problems is called natural language processing, or NLP. NLP is a set of techniques that allow computers to understand and use human language. And this is really important for artificial intelligence, because language is at the core of intelligence in all human activities. And it's really important in our space, in medicine, because the only universal interoperability standard that's really used in practice is language.

So if you look on the right here you see a radiology report, which documents an increase in size and number of left lower lobe pulmonary nodules, or a progression of disease. On the left you have a clinician confirmation that they've interpreted that radiology report and their plan for what they're going to do based on that information. Without natural language processing, we'd have no way for computers to be able to use this language. But with NLP techniques, we can build computer systems that can leverage all of this data.

And one thing you'll notice in this diagram is that machine learning and natural language processing overlap. Machine learning is actually a really powerful tool for helping to teach computers on how to use language. And that intersection is actually where we spend a lot of our time. So for the rest of the talk we'll be focusing on that intersection, on how we use machine learning to help computers understand all of the language in the hundreds of millions of documents in our network to build real-world evidence. One other note on terminology before I go on. For the rest of this talk I'll just be talking about machine learning. In reality I'm talking about this intersection, but it gets a little wordy if we include everything.

So I want to spend a little bit of time on some of the principles that guide all of the work that we do in this space. And then we'll jump into concrete examples illustrating these principles. We have three principles. The first is that machine learning will empower humans, not replace them. The second is that machine learning is math, not magic. And the third is that it's a tool, not a product. And I'm going to dive into each of these in a little bit more depth.

So machine learning will empower humans, not replace them. The state of the art in machine learning is really great at a number of specific tasks. Things like classification, looking at an email and determining, is this spam or is this not spam. Recommendation. When you're on Netflix and Netflix recommends three other shows you might like based on your previous watch history. Ranking. When you type a query into Google and it gives an ordered list of results in the order that it thinks is most helpful for your question. Image recognition. Looking at an image and determining who specifically is in that image, or maybe in a medical field looking at an x-ray and determining if a bone has a fracture.

These are things that the state of the art machine learning has gotten really good at in recent years. However, there's a number of things that the state of the art just can't do today. Things like synthesizing information from across large amounts of sources and modalities. So for example in that progression documentation, synthesizing across the radiology report and the clinician's interpretation of that radiology report. That kind of task is really hard for the state of the art today. Things like applying domain-specific knowledge, applying all the years of medical training and experience that many of you in the audience have, and coding that into a computer is really difficult.

And adapting to novel information. As I mentioned, one of the most common tools in this space is machine learning, and it learns from many examples how to behave, but if it comes across a new example, most machine learning models don't really know what to do with that, because it doesn't have experience with that. Well, humans on the other hand can use common-sense reasoning to figure out how to solve that challenge, or they can just raise their hand and say, "I don't know what this is. Let me go talk to someone who does."

Even if the state of the art machine learning gets really good over the next decade or so and starts to be able to do many of these things on the right, humans are always going to be necessary to generate the training data that these models are learning from in the first place. And really importantly, going back to the checklist at the start of today, to evaluate the performance of these models.

So that's why at Flatiron we take the approach of technology-enabled abstraction. We use technology to do the things that technology is great at, but we pair it with expert clinical abstracters, who can apply that unique human expertise to do that abstraction. So one piece of this technology is abstraction, like the abstraction lab that you can see outside and experiment with. Another piece of that technology is machine learning which we'll talk about.

Our second principle is that machine learning is math, not magic. One of my favorite quotes is that all models are wrong, but some are useful. At the end of the day, machine learning models are just mathematical objects, and we should treat them as such, instead of treating them as these magical black boxes that can solve all of our problems. And for us in the real-world evidence space, that means that we need to realize that these models are going to have some amount of error. And so we need to be able to measure those errors over time, and at scale, in all of the things that we do.

We need to be able to reproduce every prediction that's made by every one of our models at any point. So that way if two years from now, the FDA or a partner comes back to us with a question on a specific data point, we can trace all the way back to the exact model that made that prediction, what date it was fed into it, and understand why that decision was made. And then lastly we need measurement and processes that are scientifically sound. We need to make sure our data is robust in light of the fact that we're going to introduce these mathematical objects that are going to have some amount of error.

Our third principle is that machine learning is a tool, not a product. Flatiron does not want to be the proverbial hammer looking for a nail. We want to be solving important problems in the industry. We see machine learning as a really powerful tool that helps us scale up our existing products, helps us add new features, and maybe even new kinds of products. But at the end of the day machine learning alone is just a tool for building those. And it's not a product in and of itself. We're not interested in trying to use it just because it's the exciting technology of today.

So those are our three principles. I'll now hand it off to Ben, who's the tech lead of the machine learning team to illustrate what these look like in practice.

Ben Birnbaum: Thanks Josh. I'm going to take us through the first two case studies and then I'll hand it over to Geetu, who will go through the third. First case study is how we used machine learning to scale our core registries using a technique called model-assisted cohort selection, or MACS for short. I'll be talking about this in the context of the metastatic breast cancer core registry. So I want to start by reminding everybody of what that is.

This is a de-identified dataset of more than 17,000 patients. It's refreshed on a monthly basis. Every patient in this registry has been diagnosed with metastatic or recurrent breast cancer after 2011. And the data model includes things like detailed diagnosis information, oral drugs, biomarker status, and endpoints like mortality.

So MACS is a tool to make cohort selection more efficient. So I want to start to explain how MACS works by showing how cohort selection worked for the metastatic breast cancer core registry before we started using the technique. You can think about this as a funnel process. The first step in the funnel is to take all of our patients across our network and filter them down to the ones who have a structured billing code that indicates a diagnosis of breast cancer. Then we take all of those patients and we send them through to abstraction, where abstracters will confirm whether or not the patients have metastatic disease.

And the reason that we need to do this, is that determining whether a patient is metastatic really requires you to go into the unstructured data. After abstraction if the patient's metastatic, that patient ends up in a cohort, and we do further abstraction on that patient to extract out that data model. If the patient is not metastatic, then the patient is removed from the cohort and no further abstraction is done.

So an operational challenge for us here is that the vast majority of patients that we send through to abstraction are not actually metastatic. The number is fewer than 1 in 10. So that means that more than 90% of the time that we have our trained experts reading through patient charts to determine whether or not they're metastatic, it's actually not adding value to the dataset that we're creating. To put it in another way, to get the 17,000 patients that end up in the cohort using this technique, we have to abstract more than 200,000. So there's clearly a lot of room here for efficiency improvement.

Where MACS comes in, is it acts as an extra filter in this cohort selection funnel. So now instead of sending every patient through to abstraction to confirm metastatic disease, we first send the patients to a model where the model will output whether or not it believes that the patient is metastatic. If the model thinks the patient is metastatic then everything proceeds as it did before, where an abstractor will still confirm the patient is metastatic before the patient ends up in the cohort. Whereas if the model believes that the patient is not metastatic then the patient is removed from the cohort and no further abstraction is done.

So to launch MACS we went through the steps that we usually do at Flatiron to launch a machine learning use case. The first step is to get labeled data, and we use this both to train the algorithm and to measure its performance. The second step is to develop the model, and then we measure the performance. If the performance is not what we hope, we might iterate and go back and develop the model more. Once the performance is good that's when we launch it, and the final step is to monitor the performance of the model over time.

So I'll walk through how this works in the context of MACS for metastatic breast cancer. Getting labeled data in this case was actually pretty straightforward. What we needed was a map of patients to whether or not they were metastatic. And we had this from all the abstraction that we'd already done. To develop the algorithm, our intuition was that there should be short phrases in the patient's chart that are predictive of whether or not the patient is metastatic.

So if a patient is metastatic, you might see phrases in the chart like stage IV breast cancer, proven metastatic, or bone mets. Whereas if a patient is not metastatic, you might see phrases like no evidence of metastatic disease. By themselves these phrases aren't conclusive, but the idea is that if you see enough of one type or another you can make a prediction in a statistical sense.

So I want to actually walk you through the algorithm that we used to build a model to take advantage of this intuition. I'm going to preface this by saying that this is probably the most technical slide here. So if there are a lot of new concepts or you have trouble following along for whatever reason, we're going to get back to the bigger picture stuff after this.

The first step in the algorithm is to search for a list of relevant terms. And the reason that we need to do this is that associated with any given patient is a huge number of documents. Sometimes more than a hundred. These are documents like clinic notes, pathology reports, radiology reports, and very little of the information in those documents is actually relevant for determining whether or not a patient is metastatic. So by starting with a search, we can direct the model's attention to the portions that are most relevant for determining metastatic status.

Here are four example terms. In practice we tend to use more like a couple of dozen, and we come up with these terms both through experimentation and through consultation with our clinical teams. The next step is to extract snippets of text around these search terms. So here's a made up clinic note. And you can see that there are two search hits for these search terms, one for the word stage, and one for the word mets. We then go a few words to the left and a few words to the right to extract out a snippet. In practice, we go multiple words to the left and multiple words to the right. Here we're just doing one to the left and one to the right to keep things simple.

And then the next step is to do what's called bag of words feature extractions for the phrases that we see in those snippets. What this does is it allows us to map the unstructured data into a vector representation that can be used as input to a machine learning model. What that means is that for each patient, we have a vector for that patient, and every row in that vector corresponds to a phrase that may or may not appear for that given patient. If the phrase does appear, then the value for that row is one. And if a phrase doesn't appear the value is zero.

So here in this first snippet there are two two-word phrases. And I also want to mention that typically we do this for phrases of multiple lengths, but we're just sticking to two-word phrases for simplicity here. So here there are two two-word phrases, with “stage” and “stage IV”. So the rows corresponding to those phrases would both have a value of one. In the second snippet, there are two additional two-word phrases, “bone mets” and “mets were”. And so those would also have a value of one. Finally, for phrases that occur for some other patient but that we don't see for this patient we'd have a value of zero for those rows.

So the final step is to take these feature vectors across all of our training data, along with the label as to whether or not that patient is metastatic, and feed that as input to a machine learning algorithm. So in general our approach is to choose the simplest algorithm that performs well. For this particular use case we found that that was something called regularized logistic regression. For other use cases we found other algorithms that worked well, so sometimes we've used random forest, sometimes we used things like recurrent neural networks, if it's a sequential problem. And also this is not the only type of feature extraction we do. I just want to mention that. Sometimes we do structured features if those are relevant. Sometimes we do things like different types of weighting like TFIDF if you know what that is. But for this particular use case, this is what we found worked well.

And what we found specifically was that we were able to reduce the abstraction burden by almost 60%, while maintaining a 98% sensitivity. And sort of beyond these raw numbers, we're also able to peek into the algorithm, and see which phrases it determined were most relevant for determining metastatic status. And these passed a sniff test for us. So phrases like “stage IV”, “suspicious for metastatic” and “palliative” were all picked out by the algorithm as being predictive of being metastatic. Whereas phrases like “stage I” were picked out as being predictive of being not metastatic.

So this is great. We had an algorithm. We could significantly reduce abstraction burden with it. And we had a really low false negative rate, we were missing fewer than 2% of patients who we shouldn't have missed. But that doesn't mean that using this didn't come without risks. In particular, if those 2% of patients all share the same rare characteristic, imagine they all have the same rare biomarker, the same rare treatment, then analyses that are performed on the dataset that depend on those characteristics could be biased. So to mitigate this, we did what we call bias analysis. And to do this we start with a random sample of our labeled data that's distinct from the labeled data that we trained on. And we take the subset of that data that is in the cohort, that abstracters have said is metastatic. Then we take the subset of that data that would have been picked out by the algorithm.

The first subset of data we call the reference standard, and this represents what would be produced if we hadn't used MACS. And the second subset of data is what we call the MACS cohort, which is what is produced when we do use MACS. We then compare these two cohorts across a number of clinical demographic and outcome variables to see if there are any differences.

So here's an example comparison. And this is looking in particular at the first line therapy class. And what we see is what we would hope to see, which is that across both the MACS cohort and the reference standard, the distribution looks very similar. We also do this for continuous variables. So this is looking at the date of metastatic diagnosis. And again we see what we'd hoped to see, which is that the distributions across the two cohorts look similar. And it's actually hard to see, but these are actually two separate distributions, even though they overlap so much.

The final thing we do is we actually replicate analyses that are representative of the types of analyses that would be performed on the dataset. So here we're looking at the difference in overall survival between patients who are triple negative and patients who are HR positive and HER2 positive. And we perform this analysis both on the MACS cohort and the reference standard.

And we see what we'd hoped to see, which is that the result of that analysis doesn't depend on whether or not we're using MACS. And again, there's actually four Kaplan–Meier curves here, but they overlap so much that it just looks like two. So those were three examples. We do this for a number of different variables, many of which are listed here. And what we really want to see before we're ready to launch something like MACS is that we don't see any differences here, that there's little risk of bias.

So for metastatic breast cancer we did this, and we were ready to launch in July of 2017. But just because the algorithm performed well on July of 2017 doesn't mean that it's continuing to perform well today. And that's because our data is always changing. So treatment patterns change, documentation patterns change, our network is growing, and so the network itself is changing. So we've really invested a lot of effort on the team in building out infrastructure to monitor the quality of our models over time.

How this works is that we continue to abstract a random fraction of new patients coming in regardless of what the algorithm says. And we use that as fresh unbiased test data, and every week we run our models against this test data to see if there are any regressions in the types of analyses that I showed you. If there are, we're notified immediately, the model is pulled from production and we actually need to address that before using it in practice again.

So I want to conclude this case study by mentioning that even though I was focused on metastatic breast cancer, this is something that we've done across a number of our core registries. Right now we're up to nine, and you can see the sensitivity and savings numbers for all of the registries here. In total we've saved more than 18,000 hours of abstraction time since we've launched these use cases. And if you do the math what that corresponds to, is a single abstracter working nonstop around the clock for more than two years actually. So we're really excited about this and we're excited about it primarily because it means that we can redirect these resources towards things like making our datasets bigger and towards abstracting complex data points.

So for the next case study I want to talk about how we used machine learning to efficiently add BRAF to the non-small cell core registry. The use case here is a little bit different, but you'll see that the techniques and actually the overall approach are pretty similar to what I just talked about.

A non-small cell core registry has more than 50,000 patients, it's also refreshed on a monthly basis. Here patients need to be diagnosed with advanced non-small cell lung cancer after 2011, and the data model is pretty similar to the metastatic breast cancer core registry, in that it includes things like detailed diagnosis information, oral drugs, biomarker status, and endpoints.

The non-small cell core registry is one of our oldest. We launched it in 2014, and a challenge for us has been keeping the data model of the registry up to date as the standard of care in non-small cell has changed. So one of the first times that we saw this was in September of 2015. We added PD-L1 testing to the data model. And to do that we had to re-abstract everybody who was in the registry, which at the time was 14,000 patients. We saw that again in March of 2016, when we added KRAS and ROS1. Now there are 23,000 patients in a cohort and we had to re-abstract all of them.

This problem really came to a head for us in May of this year when we wanted to add BRAF testing to non-small cell. At the time there were 48,000 patients in the registry and so the first thing we asked ourselves was could we do what we did before and re-abstract everybody to get their BRAF testing status. This was really hard for us when there are 23,000 patients, so now that there are more than double that we thought it was pretty much unfeasible.

So we tried to think about other approaches, and our first step here was also to get labeled data. Here what we needed was a mapping of patients to their BRAF testing status. So what we did is we took a random sample of the patients in our non-small cell core registry, and we abstracted all of them for whether they were BRAF tested. The first approach that we tried was actually not using machine learning. So the reason that we wanted to add BRAF testing to the data model was because of an approval that happened in 2017. So the first thing we thought about is, maybe we could only abstract patients who are diagnosed after 2017, because that's probably where the most testing occurs.

So we looked at the performance of this approach on our labeled data and we saw that on the plus side, it would allow us to reduce our abstraction from 48,000 patients to 8,000 patients. But on the negative side it actually had very low sensitivity. So what we saw was that more than 50% of the testing in our test set actually occurred for patients who were diagnosed prior to 2017, so this didn't seem like a feasible approach.

We next turned to machine learning and here our intuition was similar as it was before, which is that there are likely short phrases in the patients' unstructured data that are correlated with the patient being metastatic or not. So here if a patient ... sorry, I said metastatic, I meant BRAF tested. So here if a patient is BRAF tested you might see phrases that indicate the result of that test, like “BRAF positive” or “no BRAF alteration”. Or you might also see phrases that have to do with the lab that was used for BRAF testing, so you might see something like “foundation results show”. If a patient was not BRAF tested, you might see something like “never BRAF tested”.

So we built an algorithm that, or we built a model that was similar to what I showed you in the last case study and we measured its performance on our test data. And we saw that we could reduce the abstraction burden from 48,000 to 20,000, while still maintaining the 98% sensitivity. So while 20,000 patients is still a lot, it was now in kind of the realm of feasibility for us, so we were pretty happy with this.

We launched this in June of this year, and because this model is being used for new patients as they enter into our network, we had to monitor the quality of this model over time as well. This is the actual plot of the sensitivity of the BRAF model since it was launched. The axes are a little bit funny here, but you can see that it's really oscillating only between 98 and 100%, so the performance is staying quite high as we're using it fortunately.

So to sum up, since we launched this model we were able to save 5,000 hours of abstraction just from this use case alone. Because we used it we were able to turn around this project in a month, which was really exciting for us. And because of this we think it will be a critical tool for us as we keep all of our core registries up to date as the standard of care changes.

So with that I'll hand it off to my colleague Geetu. Geetu is the data insights lead, so she does a lot of data-driven product development, and she's going to take us through our third case study.

Geetu Ambwani: Thanks, Ben. So as Ben talked about in the previous two case studies, we see that we've had tremendous success using machine learning to scale up our broad disease-based cohort registries. And we're going to switch gears a little bit and talk about a different use case that we've seen for machine learning at Flatiron, and that's to enable the creation of pan-tumor biomarker-based cohorts.

And when I say pan-tumor biomarker-based cohorts, what we mean is trying to identify a cohort of patients that are positive for a certain biomarker indication, regardless of the tumor type that the patient may actually have. So as we move towards the world of personalized medicine, we've seen a meaningful shift in oncology away from tissue-specific therapies towards therapies that target a tumor's genomic profile, regardless of the location of the tumor. Some of the therapies we've seen in this space are the Keytruda approval that we saw the FDA issued its first pan-tumor agnostic approval for Keytruda last year. Loxo also has a TRK inhibitor drug under priority review with the FDA. And to answer the critical research questions surrounding these drugs, researchers are interested in looking at real-world datasets that are pan-tumor in nature.

One big challenge for biomarker-targeted drug development is that these biomarkers are rare but widely distributed. So this means that even though they appear across a large variety of tumors there's very little individual prevalence of the biomarker in a single tumor type. So I'm going to talk about an active project in our case study today, where we're trying to find...we're working closely with a life science partner to identify patients that are positive for a certain rare target biomarker across our entire two-million-patient network.

In order to find the patients with this positive biomarker status we would have to go into the patient chart and look for these, because there's not necessarily a structured field in our EHR that captures this information. And so these are a couple of examples of how you would see biomarker status show up in unstructured data. So on the left there's an oncologist clinical note from a non-small cell lung cancer registry, and on the right that's an example of what an NGS report would look like with the genomic alterations of a patient listed.

So as I had mentioned earlier this target biomarker that we are interested in is broadly distributed but still rare, and so identifying such a cohort comes with a big set of challenges. So let's do some rough estimates to see why this is the case. So let's say we want to find about 50 of these patients. So we know that it's rare. So only 1 in 100 patients that would be tested we would expect to actually be positive for this biomarker. So that would mean that we would be abstracting about 5,000 patients to find 50 of these patients. But that's not enough, because you may not actually be tested for this biomarker in a pan-tumor setting in the first place.

So let's assume that 4 in 100 of every patients are actually tested for this biomarker. That would mean that we would be abstracting around 125,000 patient charts in order to find 50 of these patients, which is basically prohibitively expensive and we would probably never do that. So this is the kind of classic needle in the haystack problem though that machine learning really excels at addressing.

So the great thing at Flatiron that helped us do this project was that we already abstract more standard of care biomarker information for our core registries. So for instance in our colorectal registry, we abstract for KRAS, NRAS, PRAS, BRAF. In non-small cell lung cancer we abstract for EGFR, ROS1, KRAS and a bunch of others. So till date we have abstracted biomarker information for 82,000 patients across 11 solid tumor core registries. And so we have a variety of biomarker data across different tumor types and across different biomarkers.

And the key intuition behind the approach that we used was that there's a lot of common phraseology used by an oncologist when documenting the biomarker status. There's only so many different ways in which you can report a patient's biomarker status and the information is really present in the 5 to 10 word string that surrounds the biomarker mentioned.

You can see there are some examples of common phrases and our abstracters of things that we have abstracted in the past and our abstracters would place a label of positive or negative next to them. So in the first case the patient has been marked as negative for EGFR and in the second case positive for ROS-1.

So when we trained our ML model with thousands of such instances it starts to learn that a biomarker mentioned followed by the word negative is a very strong signal that this patient is negative for the target biomarker. And similarly when it learns that...sorry, the biomarker followed by the phrase “rearrangement detected” is a pretty strong positive signal. It also learns more complex contextual patterns like negative biomarker status, where we are able to look at ... it's able to learn both the left and the right context of these phrases. So given that we had this really large, rich diverse dataset of abstracted biomarkers across a variety of diseases, we found that our model actually generalized pretty well, and it was able to predict for any given biomarker across any tumor type.

So finally what happened was we funneled our entire two-million-patient network, and took all of the high-probability candidates that our machine learning model identified as potentially positive for this biomarker, and we basically ended up with a cohort of 3,000 and then these 3,000 were sent through abstraction, and it turned out that 400 were confirmed to be truly positive for the biomarker. So instead of abstracting the hundreds of thousands of patients we would have to, we were able to identify this cohort with a fairly reasonable abstraction burden.

So the one thing that's somewhat challenging here is that we lack a reference standard for this cohort. And if you'll recall from earlier, we saw that it would take about going into 125,000 patient charts to even create a gold standard for 50 patients. In this case though, the thing to note is that unlike in our core registries, we know exactly how this dataset is going to be used downstream. And with our partner we've determined that there is some tolerance for bias, given that this kind of a real-world dataset would be impossible to create in any other fashion.

And that said, we are taking all the steps we can to measure bias in any way we can. As we saw in the earlier case studies, we'll be doing an extensive comparison of this cohort against benchmark cohorts across a variety of clinical demographic and outcomes variables. The benchmark cohorts here can be NGS-tested patients as well as our entire broader Flatiron network. And in this case we know that the target cohort will also be distributed across a fairly heterogeneous disease population, so we want to also do outcomes analysis where we are stratifying by prominent disease groupings.

So, ultimately we recognize that finding truly needle in a haystack populations there may be a limit to how much we can measure the bias, and how representative this population sample really is. This is counterbalanced however by the fact that there really is no other way to find them, and ultimately we are enabling a type of research that wouldn't be possible otherwise. I'm going to pause on that image because I only spent three hours maybe last night looking for it on Google Images, but basically there's a proverbial haystack as you'll note, but there's also these bright blue skies in the background. And that's meant to capture our excitement about being able to use machine learning to enable these kinds of novel real-world use cases.

So as you saw from the case studies that we saw today, we are building on a very strong foundation for machine learning here at Flatiron. On the one hand, we have this really rich dataset of abstracted data that our models can learn from. But on the other side we've also thoughtfully and heavily invested in creating this continuous monitoring framework and infrastructure to allow us to always determine that our machine learning models are performing appropriately. And so we believe that by building on this strong foundation, Flatiron is in a really great position to use machine learning to unlock more and more novel use cases. I'm going to end by taking a brief look at some of the areas we hope to address with machine learning at Flatiron in the coming years.

So, in the shorter term we are definitely very focused on what we think of as using machine learning to structure all of our data at scale. And so that means a lot more of the kinds of things we talked about in our previous case studies, and we think that this will greatly help us increase abstraction of capacity and reduce end-to-end study time. We are also going to be investing more in our machine learning biomarker capabilities, so we can enable pan-tumor research on rare populations.

Moving on to more longer-term exploratory things, I'm going to caveat this by saying that we're early in our thinking here, but we do think that there's a lot of potential for machine learning to be predicting prognostic factors. So you could think of patient performance status or likelihood of progression as being things that a model could be trained to do. And you could imagine that progression would be used really well when you think of stratifying groups for risk factors. We also think that having a really good proxy for patient performance status will allow us to move beyond the limitations of ECOG. And as we've all seen from the newspapers there's been tremendous progress of machine learning in the imaging space, and these are things that we would love to be able to take these technologies and incorporate them into our products.

And then finally it would be great if we could use machine learning for hypothesis generation. So we have this growing patient network at Flatiron, and we could potentially be tracking off-label usage of drugs and outcomes. And finally we have this really disparate set of datasets that we have at Flatiron, and so you could imagine that we could identify novel prognostic factors across like genomics, EHR data, and radiomics.

Audience Member Question: I have a quick question. Your immediate long-term, you mentioned about the image analysis. Can you talk more about your plan related to the imaging, and do you have the imaging for example stored with Flatiron? What type of the medical imagings are we talking about, is it MRI, CT or pathology imagings? Any plan for that? We do have internally some of these imaging in the analytics program so I'm interested to see if a potential collaboration. Thank you.

Josh Haimson: Yeah. Happy to talk about that. So one thing, again caveating, this is all really early. It's more a space that we know machine learning is really good at doing and so we want to invest in working with it. Today we don't have images at scale across our network, but it's an active effort to try and work on that. But also interested in collaborating in any areas where potentially our partners have images that we could work with.

Michael Kelsh: Hi, Michael Kelsh from Amgen. Just a question. We saw several estimates of the hours saved by the machine learning. I was wondering what's the other side of the equation of what's the investment in time. Say if we were thinking about we'd like to do a machine learning for disease X. Any sense of...and I guess people too would it take?

Ben Birnbaum: That's a great question. So our team has been around for about a year and a half, and we've spent obviously a lot of effort on these things. I think in general we try to prioritize our efforts towards things that are going to have lasting value. So for both of these use cases that savings will continue to accumulate over time. And there's also sort of an aspect of building up platforms and capabilities that once we do basically like you saw, like once we have the techniques figured out for that first use case we were able to apply something pretty similar to something else. So that's kind of why we're investing in it. But it doesn't necessarily come cheaply. It is definitely an investment. Yeah. And then also our models improve and as we get more data and we abstract more data our models improve and we get more efficient as well.

Eric Klein: Hi. Eric Klein with Lilly. Have you done anything with trying to apply machine learning to prediction of adverse events, safety, or treatment discontinuation?

Geetu Ambwani: So, we are actually we have an active workstream on looking into adverse events. But we are very early on and this is something where we are still trying to paint through from the basics of how adverse events really show up, right, in , and how much of this can we capture. And how well does this validate with maybe trials data or other gold standards. And we definitely have with an eye towards eventually maybe using machine learning, but I think at this point we are very much much more interested in validating that this safety data is adequately captured in real-world clinical data.

Diana Merino: Hi. I'm Diana Merino from Friends of Cancer Research. I just had a quick question with regards to patient-reported outcomes. And perhaps you have thought about the integration, just kind of piggyback on the previous question about how that could potentially be intersected in order to come up with a better understanding of the patient experience in the real world?

Josh Haimson: Yeah. I think PROs are similar to the imaging bucket, where it's something we'd be excited about working with, but to date we don't have it at scale across our network. So we haven't explored many of the tactics there. But again, I think there's a lot of opportunity with PROs and also things like wearables to start picking up signals in that kind of data, and correlating it with the clinical data that we already have. So it's definitely an area of interest for us, but again we don't have the foundation of the PRO data to work with.

Audience Member Question: I think I have a question to Ben. So you mentioned an example that you select metastatic-status patients in breast cancer. Have you ever considered about how to define advanced cohort using machine learning, because the definition of advanced cohort is much more complicated than the metastatic status. Another question is you mentioned progression in your roadmap. Are you going to do anything related to response data?

Ben Birnbaum: Great questions. So, the first question was advanced status and have we used machine learning to predict advanced statUs? We actually have. So, a good example of that is the non-small cell core registry where the cohort selection criteria is having an advanced disease, which I think is stage IIIB or stage IV.

What we have seen to your point is that the models don't perform quite as well. The nice thing is that we can kind of tune our models. So generally there's a much greater risk of having a high false-negative rate than high false-positive rate. So if our models aren't performing quite as well we just tune our decision boundary to make sure that we're still getting high sensitivity, and then if anywhere we're sort of suffering on having a lower abstraction savings there.

But I guess the answer is like we do see worse performance. It's still something that is pretty achievable through machine learning, at least in our experience, partly just because we have a lot of labeled data that we can train off of. And then we kind of adjust our thresholds afterwards to account for any differences that we might see in performance.

So second question was are we doing anything for response. Again, I'm kind of echo what Geetu said, which is that this is something that we're working on kind of broadly at Flatiron, which is validating what real-world response looks like. So that's kind of the first step for us. Once we get there we'll start to think more about how machine learning fits in.

Audience Member Question: So as you guys are building and developing these algorithms and the capabilities, are you also thinking about building structured fields that then can be sort of reintegrated into the EMR systems and such that the drag on pulling from the text is sort of being reduced because you're sort of front-ending a lot of the information that's being collected? I'm just wondering what your cycle is and how you guys think about that. Certainly across the settings of care, you know there's lots of different ways that one might do that. I'm just curious about your thoughts.

Josh Haimson: Definitely. So we've explored figuring out ways that we could modify the EHR in order to capture some of these higher-priority data points. And we've had success there in limited areas. At the end of the day we don't think physicians are going to capture every single data point that's necessary for research. At the point of care they're already burdened with their EMR enough. I think there are opportunities to use some of the models to more efficiently structure the data on the front end and just ask for a confirmation from the physician. And it's definitely something we're interested in and exploring.