In this interview with James Baurley, data scientist, and co-founder of BioRealm, we discuss how biological markers, especially genetic biological markers, may be used in clinical trials, and a few of the computational and statistical issues and limitations.
Conor Ryan: What are biological markers?
James Baurley: There are many definitions of biological markers, also known as biomarkers, and not all of them agree. In some ways researchers are still trying to decide on a single definition. I think of a biomarker as just a variable and it's measured on or in an individual. We usually say biomarkers are either diagnostic or predictive: a diagnostic biomarker can be used to detect if you have a disease, like early cancer detection, and a predictive biomarker can predict how a drug will work for you.
CR: Can you give us an example?
JB: We’re currently working on a very interesting project that’s a great example.
Today, if you want to quit smoking, there are many methods available, but, depending on the individual, some of these methods will work better than others, and some can’t be used due to the side effects.
What we really need, and what we hope to develop, is a way for a you to go see your doctor—or go to a store and buy a test kit and send a DNA sample to a lab—and get a report that tells you the most effective way for you to quit, with the fewest side effects.
One genetic biomarker that might help us to develop that test is the way your liver very quickly breaks down the nicotine that enters your bloodstream into two other compounds. The ratio of these two compounds is called the nicotine metabolite ratio. It varies with each individual, and other clinical trials have shown that it can help determine which smokers are going to be most successful at quitting.
We want to take that genetic biomarker, combine it with other biomarkers, perhaps some clinical factors, and develop the test we need. It’s a huge challenge, so we're using many different approaches and tools.
CR: It sounds like genetic biomarkers could be extremely useful in solving many medical issues. How do you find new genetic biomarkers?
JB: The two most popular methods are the “candidate gene” approach and the “genome-wide” approach.
The candidate gene approach is used when we strongly suspect that a given gene, or given variations in that gene, might make someone susceptible to a given disease, or help prevent them from getting it.
The genome-wide approach, also known as a hypothesis-free approach, is used when we genotype as many variations across the genome as we can, and then test for associations one at a time.
CR: What are the limitations of these approaches?
JB: We have to perform many tests and there can be many false-positives. We’re also testing one variation at a time, so if the disease is caused by many factors, you'll likely miss some important ones. That means the approaches are expensive, in terms of both money and time, and you might not get results that are predictive, who might be most susceptible to a disease, or diagnostic, who might already have a disease.
CR: Are there any new statistical tools that help?
JB: What's most exciting and interesting is that we're finally able to use many statistical tools that were discovered a long time ago, but simply required too much computational power. Today we can use those tools, and even improve them to make them more powerful, enabling us to essentially reverse-engineer biological pathways based on the data.
CR: You mentioned computational power. What are some of the other computational issues?
JB: There's a lot going on. In a genome-wide case there could be many millions of variables, and it's often not just one of them having a huge effect, but many of them working together in combination, in many complicated ways.
In that kind of problem, there are so many variables and combinations that we can't test every one, not even with all of the computational power available today, so we get around that limitation by using approximation. One tool we use is Markov Chain Monte Carlo methods on the data we've already collected, but that can take a lot of time to find the combinations that are really interesting. We also use external databases to speed things up. For example, those external databases can tell us that there might be an entire area that we don't need to look at any further, saving us a lot of time.
But before we can get to any of that, we have to collect all of the data, arrange it, and clean it. Every part: collection, arranging, and cleaning, is an art and a science in itself. Each step must be done professionally, by experts that have a lot of experience in their given areas, so that we can avoid mistakes that would make our results meaningless, and to save as much time and money as possible.
All of this, from data collection to data analysis and finally determining what biomarkers are really useful, is a complex process, and there isn't a single software package you can just run that spits out an answer. We're very lucky at BioRealm, because we've built a team that have literally hundreds of years of experience in areas as diverse as genetics, database and medical research web application development, statistics, clinical trials, bioinformatics, and, of course, regulatory concerns.
CR: Once you believe that you have a good biomarker, what do you do next?
JB: We have to validate everything. We take our results and attempt to replicate them using sets of samples from different human populations, and look for diagnostic and predictive properties all over again.