Skip to main content

“It seems the road from Stanford to Harvard leads through Hungary”

Interview with Balázs Pataki, a lead developer of SZTAKI DSD implementing ARP, who showcased the results of the ARP project both to Harvard staff at the Dataverse Community Meeting and to Stanford developers during a visit to California.

In an earlier post we already mentioned that you’re taking part in Dataverse Community Meeting (DCM) 2025 in Chapel Hill, North Carolina. So how did you end up in California at Stanford? Did your plane land at the wrong airport?

Balázs Pataki: Fortunately all the planes—and even the in-flight meals—were fine. The primary purpose of the trip really was to attend DCM and give a talk, but before that I’d already been collaborating on ARP with the CEDAR developers, who invited me to “drop in” at Stanford and give a presentation on how we integrated their software into ARP. A good friend of mine lives in California and, by chance, also works at Stanford, so I combined visiting her with the CEDAR presentation.

What exactly is the CEDAR software?

Balázs Pataki: CEDAR is a key component of ARP; it underpins the ARP Schema Registry, which we use to create and store the metadata schemas employed in ARP. At Stanford it was originally developed to support medical and biological research: researchers design their metadata schemas in CEDAR and then record their results in line with those schemas. In ARP we re-used CEDAR’s schema-editing and storage capabilities and integrated them with the rest of our system: thanks to this, schemas—“metadata blocks”—can be created for the Dataverse repository, and the same schemas can also be used in our AROMA RO-Crate editor for file-level metadata and for searching in the ARP knowledge base. Over the past few years we had several online meetings with the Stanford team about integrating and extending CEDAR for ARP, but they hadn’t yet seen how we actually used their software in our system. That’s why they invited me to give a talk about our developments.

And how did they like it?

Balázs Pataki: I think they liked it quite a lot. We set up a demo version of ARP that can run locally on a laptop, so we can carry it anywhere and demonstrate ARP’s unique features. What interested them most was how we use the CEDAR-built schemas inside Dataverse, because CEDAR wasn’t originally designed for use in other systems. In our planning we felt that the metadata-block–based schema extension in Dataverse wasn’t easy enough for researchers: to create a schema you have to fill in a huge, complex Excel sheet—hard even for programmers, let alone researchers. CEDAR, on the other hand, offers a simple visual editor, but we had to build the Dataverse integration ourselves.

Let’s fly back to Chapel Hill and the DCM. What was the event about, and why was it worth travelling all the way to North Carolina?

Balázs Pataki: Just as we collaborated with the CEDAR team, we spoke a lot—and mostly asked questions—to the Dataverse developers while building ARP. Harvard, the main Dataverse developer, holds monthly Dataverse Community Calls, online sessions about ongoing work and user experiences. During one call there was a presentation on how the Open Science Framework (OSF) uses CEDAR. It wasn’t really Dataverse-related, but I commented that we have a deeper CEDAR integration, and I gave a quick demo. The Harvard colleagues liked it so much that a few weeks later they asked for a dedicated online meeting where we showed in detail how CEDAR can be used to create and edit metadata blocks and to handle file-level metadata with RO-Crate. They said this was such an important extension that it should be shown to the whole Dataverse community at DCM 2025 and that we should explore how it could become a core Dataverse feature. As for the conference itself: DCM 2025 hosted talks on just such unique developments, operational experiences, and future directions for Dataverse. We saw a very strong European Dataverse core—Norway, the Netherlands, Belgium, Germany, France—whom we’d like to work with more closely because, like ARP, they aim to integrate into EOSC. It was also interesting to hear that most data repositories face the same challenges as ARP, whether it’s securing funding for operations or educating users.

You mentioned future Dataverse development. What can you tell us? Is everything revolving around AI now?

Balázs Pataki: Not everything, but naturally AI came up a lot. Everyone is looking for ways to automate the repetitive, sometimes tedious parts of data deposit, automatic metadata generation, AI-supported data processing, and so on. Three AI-related developments were presented. The first is a “Chat with your data” feature, a free-text interface for querying datasets and their contents. The second, an AI chatbot that helps users search the Dataverse documentation. Most importantly, an implementation of an MCP (Model Context Protocol) server that opens Dataverse installations to tool-using LLMs, enabling complex agent-based workflows with deposited datasets. Beyond AI, another major focus is redesigning the Dataverse user interface. A new React-based GUI has been implemented and will debut in version 6.7. This required significant server-side changes, but from now on anything doable in the UI will also be possible via API calls, making it easier to build custom front-ends and services. That’s especially interesting for us because AROMA is also a React app and will integrate more smoothly with the new Dataverse UI than with the current setup.

Dataverse is open-source software. How many people work on the project?

Balázs Pataki: I just checked, over 200 contributors, including four developers from SZTAKI. Harvard leads the project, but they’re increasingly open to external contributions. Networking Dataverse installations linking separate instances is gaining importance. Two related developments were shown: Dataverse Hub, which collects statistics from 140 installations worldwide and presents them in a nice dashboard, and the Dataverse Marketplace, which will make it much easier to extend an installation with new features or modules directly through the UI rather than via command line or API. The Marketplace also lets external developers distribute their modules and gain recognition. Our goal is to make our CEDAR and RO-Crate work available there, spreading the metadata methodology we developed in ARP, which, judging by the conference, is among the global front-runners.

It’s still a long way off, but what are your plans for DCM 2026?

Balázs Pataki: First, we’ll have less travelling to do because it’ll be held in Barcelona. Second, we’ll spend this year feeding our developments - currently available only in our own fork - back into the official Dataverse and CEDAR releases and publishing them on the Dataverse Marketplace. That will require cooperation from both Stanford and Harvard, since changes are needed in both pieces of software for them to work together. Interestingly, those two groups haven’t had much contact so far—we’re the ones bringing them together. It really seems that the road from Stanford to Harvard leads through Hungary, through ARP and SZTAKI.

Media
Screenshot 2025-06-25 at 10.43.07_0.png
PNG image
6.47 MB