Skip Navigation
  Print Page

Dr. Paul Albert: Taming the Data Tide

Skip sharing on social media links
Share this:

Making sense of lots of data is something that Biostatistics and Bioinformatics Branch Chief Paul Albert, Ph.D., of the Division of Intramural Population Health Research does every day. Every study can present a new puzzle to solve, and that’s just the way he likes it.

Q: When your neighbors ask you what you do for a living, what do you tell them?

Dr. Albert: I tell them that I’m involved with designing and analyzing a wide range of studies. So, we’re involved at the start of many major studies in terms of what is an optimal design and how best to save resources.

Q: How long have you been at NIH and the NICHD?

I’ve been at NIH nearly 25 years and I’ve been at NICHD―in July it will be 4 years.

Q: What role does a biostatistician play in developing and analyzing a study?

NICHD has just an enormous amount of longitudinal data from studies ranging from reproduction, to child health, to adolescent health, to life course kinds of studies. The interest is really on how things change over time. All our studies have that component to it.

I’ll give you some examples. I’ve been involved in the NICHD fetal growth studies. These are large cohort studies where pregnant women are followed throughout their pregnancies. Ultrasounds are taken and used to come up with a standard of how fetuses grow. So, at repeated time points at different gestational ages, the woman will come into the doctor’s office and get an ultrasound. We have Caucasian, African American, Asian, and Hispanic women.

So we would be involved first in how many subjects you would need in the study: Do you need 600 for each racial group or would you need 20? We basically decided on 600 based on getting precise answers to those particular questions. So we’re in the process of collecting that data.

Q: How do you analyze this data?

The exciting thing about NICHD is, those studies pose new analytic challenges that no statisticians have looked at before because we didn’t have this richness of data. So for example, I and members of my group have been looking at new ways—better ways—to predict a poor pregnancy outcome from longitudinally collected data like ultrasounds. Here we have many longitudinal measurements because we have imaging at multiple time points, we have biomarkers of physiology, and we want to combine them all to come up with a predictor of, say, macrosomia (an abnormally large infant) or some poor pregnancy outcome or some type of neonatal morbidity.

We want to predict early—so the ideal thing would be to come up with a strong predictor of some poor outcome as early in gestational age as possible so that there could be an intervention, nutritional or otherwise.

Q: How do you work with scientists who are doing such different types of research?

All of our studies are on the cutting edge. That’s what’s really, really exciting about our Division and, I would say, the Institute as a whole. We could work on and solve interesting statistical problems in every one of the studies because they’re so rich and so interesting.

So ranging from the Prevention Research BranchBruce Simons-Morton has studies on teenage driving in which I’m involved—to molecular studies of gestational diabetes External Web Site Policy that are being done by Cuilin Zhang of the Epidemiology Branch. We’re involved in the study every step of the way.

Q: You mentioned that you find new ways of doing things. What might those be?

Sure, so I'll give you an example of something that I am working on now with a Ph.D. student at the University of Pittsburgh. We had a problem based on a study called NEXT, which is in the Prevention Research Branch, and it’s longitudinal. It’s a cohort study of adolescents, population-based, which means it is all over the country, so we have samples all over the United States.

And what they were interested in looking at was how teenagers sleep, and if you have delayed sleep onset or if you sleep for long periods of time or have certain sleep patterns, does that relate to future poor behavioral outcomes? What is a typical sleep pattern? I have a teenager, so I was particularly interested in this.  

And so we started talking with the investigators and realized that there is no really good measure of teenage sleep. They asked the kids to self-report. Well, those who have adolescents know that teenagers often don’t recall things that well. But we have that information. Plus, they wear a special watch that  measures their activity. They wear the watch over a week-long period. They move their hand much less when they’re sleeping than during the day.

So based on those two sources of information, what my summer student and I did, along with our collaborators,  was to use what’s called a “hidden Markov model” to reconstruct from those variables the “hidden” true sleep–wake cycle. So the idea is, we don’t really see them sleeping. The only way you could do that is if you had someone watching them, and then of course that would change everything. But what we try to do is to use those two measures―and we even have other measures―to try to uncover the true sleep process.

The model actually worked really well. So that’s the kind of thing we would publish in an applied biostatistics journal, and then it will be used in subsequent analyses in the NEXT study and subsequent research within our Division, and then hopefully within the community of researchers at large.

Q: What was cutting edge about it?  Was it the way that you used the watch or was it the way that you validated it?

Well, what is cutting edge for us is different than it is for the investigators, right? So for Ronald Iannotti, who is an investigator on that study, what was cutting edge for him was coming up with ways to use that watch in a population-based study. For us, it was actually in developing a statistical model that we can use to recover the unobserved sleep-wake cycle.

Q: And this is more than a scientific question. This relates to health and safety?

Sure. One of the things that our Division tries to do is―of course, we do science, but we also do it in the population setting. This allows us to try to bring scientific innovation to the population, to real people doing real things. Every one of our studies, in some way, addresses an important public health problem. It is a credit to the Division and Institute leadership that this type of research is encouraged and supported. Teenage driving is one of those areas. There are many other examples ranging from understanding fertility and studying what predicts gestational diabetes and it’s translation to type II diabetes with a life-course approach, just to name a few.

Q: It seems like the way scientific data is gathered and used and shared has really changed in the past 15 to 20 years. What aspects are you the most interested in?

Yes, there has been an enormous data revolution. A lot of that comes with increased technology, with genomics, which is really high-dimensional now. And it’s not just genomics, but we could be looking at the chemicals and metals, for example, environmental exposure. There are many, many chemicals that people may be exposed to. How do you summarize that information? Do you add up all of the chemical measurements and say that’s the exposure? No, maybe that isn’t the best thing. Do you treat them all as separate? Well, that’s a nightmare in analysis. How do you analyze so much data and so many potential false positives? We need to come up with ways to combine the data in such a way as to make it useful. And that’s part of our goal, our mission.

And one of the exciting things that we as NICHD investigators can do is to move it in a new, creative direction. We have more and more data, so how do we get a signal from that, from all of that information? And it leads us as statisticians to work on a whole new class of problems that people 10 years ago didn’t even think about.

Q: Is that the goal of most of these studies, to predict disease or abnormal outcomes?

That’s a goal of many of the studies. So prediction is, of course, very important. Also, understanding the mechanism is important. There may be certain exposures or genetic traits that are highly associated with disease, yet they do not fully predict that disease. Even in these situations, just understanding the association is important for learning about the biological mechanisms of the disease process.

Q: One of your research interests is missing data. What is missing data?

The first thing I’ll say about missing data is that you should avoid it at all costs. So, if you have a study, you try to make sure you follow up with individuals, especially longitudinal studies. For example, I was involved at one point in an opiate clinical trial, and addicts were told to come back for a urine test. Well, a lot of them didn’t come back at certain points. Why wouldn’t they come back? Maybe because they would test positive; and that's not random “missingness.”

When your clinical trial has 20% missing data like that, what do you do? And so there is a whole field in statistics and in biostatistics on how to analyze data where you have missing data. And we make a distinction between what we call “missing at random,” meaning that people just don’t come and it really doesn’t relate to anything about the study. And then we have another term called “informative missing.” The reason why they are missing is because they would have had a large value if you had seen them. Developing new ways to analyze data with missing values, particularly for longitudinal data, is a relatively new and exciting area in statistical research.

Q: In terms of the opiate study, how did you deal with missing data?

So we basically developed a new methodology to deal with it. We did the analysis many different ways. We did simple kinds of imputation approaches, where we just said we’ll give the worst case, we’ll give the best case, and then we used our more complex models. What we saw is that, in every way that we did the analysis, the new way was more sensible. But we did find that the magnitude of the difference between the treatment groups was very similar [regardless of method of analysis], and so we were pretty sure that the treatment was effective. Actually, we presented it in the paper in different ways. And we said this is reassurance that the effect is real. And that is how we dealt with that.

Q: You mentor postdoctoral fellows and students. How has your background or experience with your mentors shaped your approach in mentoring?

I had very good mentors, and I think one of the things that they did for me is let me explore new areas and also to learn that research is a struggle. I try to let students discover by themselves, with help from us. I really try to foster their creativity. I think one of the most important things in mentoring is to bring things out of the student or the fellow and let them learn that the beauty of science is to explore and to come up with new things. It’s not drudgery but real excitement.

Q: You seem excited to talk about your work, and you get some interesting studies to work on.

I tell you, it’s the best job in the world.

Last Updated Date: 07/03/2013
Last Reviewed Date: 07/03/2013
Vision National Institutes of Health Home BOND National Institues of Health Home Home Storz Lab: Section on Environmental Gene Regulation Home Machner Lab: Unit on Microbial Pathogenesis Home Division of Intramural Population Health Research Home Bonifacino Lab: Section on Intracellular Protein Trafficking Home Lilly Lab: Section on Gamete Development Home Lippincott-Schwartz Lab: Section on Organelle Biology