The role of metadata management tools in the evolution of open data formats
There’s an ongoing changing source of truth amid the data platform shift. It’s a rapidly evolving situation, as companies must consider open table formats and metadata management tools.
The open table format landscape includes Delta Lake, Iceberg and Apache Hudi. Much has happened in recent months, including Databricks Inc. purchasing Tabular Inc., according to Bob Muglia (pictured), entrepreneur and builder.
“In a way, right now, Databricks has controlling capability of both the Iceberg and the Delta formats, but this is important to all the other vendors, and we’ll just watch what happens over the coming months,” Muglia said. “I do think that we’ll continue to see coexistence of these two things. In the last year, fortunately, there have been tools that have been developed to allow for both to be used simultaneously.”
Muglia spoke with theCUBE Research’s George Gilbert at the Supercloud 7: Get Ready for the Next Data Platform event, during an exclusive broadcast on theCUBE, SiliconANGLE Media’s livestreaming studio. They discussed the evolution of data platform standards and the importance of metadata management tools.
Metadata management tools and universal open-source capabilities
There are metadata management tools that exist today including XTable, which copies metadata. Fortunately, data formats all use the same on-disk format for data, according to Muglia.
“It’s really just the metadata we’re talking about. But I do think we’ll see those things converging, and I expect to see an open-source capability coming out, an open-source environment coming out that will be adopted pretty much universally across the vendors,” he said. “That’s what I hope to see anyway.”
The second thing that appears to be happening is catalogs being built on top of open data lake formats and collectively between a catalog and an underlying data format that is one’s source of truth, according to Muglia. They’re being developed, but they’re not very compatible with each other.
“Once again, my guess is that’s just early stages of things, and we’ll start to see something emerging that could be compatible and used across multiple vendors, but that’s certainly not where we are at the moment,” he said. “We’re early stages of this transition from where we have proprietary formats to an open format, but the industry hasn’t quite settled on it yet.”
It’s clear that the source of truth isn’t just the data. Metadata has to start with the technical operational data because the data warehouses and tools that run in the data environment have to be able to work with the data in a cohesive and secure way, according to Muglia.
“I think over time it’ll include the higher levels of semantics as well. This is one of those open questions. Nobody really knows how that’s going to develop,” he said. “As you go up the stack and try to do more and more, you may want to have more and more capabilities, which could be an opportunity for vendor differentiation as well. So we’ll see.”
The challenge of unifying technical, operational and semantic data
It all poses a question: Is there a way to separate the technical metadata from the operational metadata, from the richer semantics? Or, if one wants a coherent source of data, do they all need to have one underlying unifying owner?
“I don’t think you need one engineer for it,” Muglia said. “I think you need to have a way of accessing the data coherently across multiple engines, potentially.”
For instance, if one had knowledge graph database processors, that would want to work with the same information a SQL database would be working with, according to Muglia. It means that some of the same metadata is required.
“But then there’s a lot more information that one could put in the higher level semantic layer. And in fact, if you look at that, there’s a lot of operations that you want to perform on that data,” he said. “They’re graphs, and they’re complicated graphs, and there are relational operators that can be applied across the graphs.”
Today’s databases and catalogs don’t do that. But change is happening fast.
“You need something different, which I believe is a relational knowledge graph, which we’re starting to see emerge now,” Muglia said.
Ultimately, companies will need to have a vision across all of their underlying metadata to get a consistent source of truth, according to Muglia. These changes are still far off into the distance.
“We’re really just beginning to see the emergence of this metadata in this semantic layer as a real thing,” he said.
Stay tuned for the complete video interview, part of SiliconANGLE’s and theCUBE Research’s coverage of the Supercloud 7: Get Ready for the Next Data Platform event.
Photo: SiliconANGLE
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU