To highlight researchers who have derived impact or benefit from sound data management practices, the Research Data Initiative (RDI) within the Duke Office for Research & Innovation has launched the Data Management Exemplar Series. If you or someone you know is a data management exemplar, please connect with us so we can continue to foster a research community fluent in strong data management practices.
Continuing our interview series, our interviewee today is Patrick Charbonneau, Ph.D., an associate professor of chemistry and physics in the Trinity College of Arts & Sciences. He studies soft matter using theory and computer simulations. Professor Charbonneau has notably earned a National Science Foundation CAREER Award, a Sloan Fellowship, and an Oak Ridge National Lab Ralph E. Powe award.
Can you tell us a little bit about your research area?
I’m a computational and theoretical researcher with a focus on soft condensed matter physics.
There are several different types of systems that fall within that category. The most prominent topic in my group is the glass problem, but I also work on systems that form interesting mesoscale structures, such as gels and periodic microphases, as well as on the crystallization of proteins.
How would you describe the makeup of your research team?
There are a handful of students and postdocs who are part of my group. They do some one-on-one work with me, and sometimes parts of the group collaborate together on a project. We also work extensively with external collaborators, mostly in France and Italy.
In your scientific career, when did you first hear about research data management as a concept?
There is not a sudden event that made me realize the benefits of data management. It came about gradually.
During my Ph.D. studies, data management was simply not practiced. But as a postdoc, I collaborated with a Ph.D. student, who when he finished his thesis, burned two DVDs with all the data for the work we had done together. It was well structured and included things like associated metadata. I thought this was really helpful, and it was comforting to have these disks in hand.
When I started at Duke, I used this example to guide how my own graduate students would go about managing data. Around the time of their defense, students would clean and organize all of their data and store them on a shared departmental drive.
But as time went on, I started to realize that this was too little. When external researchers would ask me for datasets, if I wasn’t still in contact with the student who worked on that project, then I would need to go through the student’s files to find the relevant material. In some cases, this could take me half a day. It became apparent that this was a problem as it was taking a lot of my effort to respond to perfectly reasonable requests for data.
I was thinking about what could be done to improve the situation, when I saw a paper from a colleague stating that: ‘if you want to get the data presented in this paper, consult this URL.’ That link was managed by the university libraries at their institution. I saw the advantage of a system like this in that my students, post-docs or even myself would not need to be present to connect the data to someone who is requesting it. Also, as someone who had dealt with the frustration of requesting datasets from other researchers and not getting a response, I thought this could be a robust system.
Following that realization, I approached the Duke University Libraries Research Data Repository and asked, how can we put together a process such that when I publish my paper I can deposit the associated dataset at the same time, and that the two could be linked so that whoever wants access to the data can?
Spring 2016, I switched over to posting systematically my datasets into the repository.
Have you had any barriers or difficulties with new postdocs or new students with having them use your system of data management?
Students largely learn the research process from scratch anyways. They typically don’t know how to do things like write (good) code so they end up learning how to do those things while also learning proper data management. As a result, the process doesn't seem strange to them at all.
For postdocs, it’s more of a shock, but I think over time everyone sees the value of doing data management. The benefits quickly become self-evident.
Have you had instances when data management assisted you with re-orienting yourself to old project data?
Not quite, but I have a counter example. Right now, I’m writing a review that includes a section about work we did just before we moved to depositing data to repositories.
I need to re-assemble an old figure to make it a prettier and include more information. However, that requires me going back through old student data, which had been saved but not deposited and therefore require some effort to localize. Because the metadata is likely incomplete, I will also need to figure out how the data is structured, how to plot it, etc. Because there are other parts of the review to write, I’ve been putting off this task, but eventually I’ll have to dig into those old files. When I do that, I plan to also deposit this particular dataset, to avoid future pain if someone ever requests it. In any event, at some point, I know I’ll have to dedicate an afternoon to this thankless task, which could probably be done in 30 mins if the data had already been deposited.
Regarding your open system of sharing data through repositories, have you had seen any benefits to data reuse or citations?
I’ve not monitored reuse and, in a sense, it was never my objective. Impact as a beneficial outcome of data management is often mentioned by people like you: “By sharing through a repository you may get more citations or data reuse.” Frankly, the number of reuses is not as important to me as knowing that the raw data, the codes and the scripts are out there for others to access and review. What’s important is that this material is already prepared, so that in 10 years if someone wants access they are able to get it without me needing to reach out to a former student or doing the work myself.
Have most researchers in your discipline moved into having more sound data management practices or is it still something that has not caught on?
Using data management creates a precedent, and raises expectations of what people could and should do. A few years ago, I worked on a project with two other PIs. When we wrote papers together, I insisted on depositing the data into a repository and things went smoothly. Recently, when speaking to one of the two PIs, they mentioned how much they liked my data management practices and had since followed on. However, they also mentioned that the third PI had not kept up with the model, because they deemed their data to be really valuable and important. They felt that if a student invested in doing a piece of research, he or she needs to be able to write a couple papers using that data before sharing it. While we both understood those sentiments, at some point keeping data private goes beyond data protection and becomes data hoarding. Data should be published as soon as it stops being precious, if not earlier.
Personally, I don't feel the need to keep much data hidden, because I don't think there is much immediate competition on most of the things I do. Even if there were, I also see benefits in someone else using my data before I do. In the longer term, that might allow me to build a broader research effort that is more robust.
Once you try open sharing, you may embrace it fully or slightly less so, but the benefits accumulate regardless.
What advice would you give a fellow researcher who wants to get started using data management?
My advice is that depositing data is only really hard once. Maybe the second time is hard-ish but it doesn't stay hard. Eventually, it's just coloring the process of doing research. Like many things, we learn it, we do it, and then it becomes easy. You have to figure out the timeline, the workflow and how to structure your data. But you get feedback from professional curators and you learn from your mistakes. Afterwards, it only requires a small effort to do it right every time. There is no reason it should slow down research in any way once it becomes part of how you do things.
In practice, the impacts of managing data quickly trickle down to other areas. One thing you will notice is that your entire research process becomes geared towards the deposition process. Once you know you're going to write a paper, you have some idea of the figures, and therefore you know how you're going to deposit the data. You know how to structure data, to write scripts etc. Everything becomes aligned towards the creation of a shareable deliverable.
Do you have any closing thoughts or comments around data management?
I’ve been arguing for a long time that we should favor the ethos of data management, rather than a pushing a compliance-based approach. I genuinely believe that data management is good for you, good for your students, good for the group, and good for the research community. There are obviously situations that are harder to deal with, such as identifiable data, so I know that it's not always easy. But the key point is how much better you feel about your research when data management becomes part of the process.