Open lakehouse spurs innovation amid AI data demands
Generative artificial intelligence is demanding breakneck innovation from enterprises. It’s highlighting a critical need for cohesive data management and driving a seismic shift in data storage, processing and utilization. It’s also prompting a rethink of the open lakehouse concept pioneered by companies such as Onehouse.
“We firmly believe an open lakehouse is the way of the future, but there is no Snowflake experience for the lakehouse per se,” said Vinoth Chandar (pictured), chief executive officer of Infinilake Inc. (aka Onehouse). “Onehouse was founded on the premise that we are going to bet on this open lakehouse being the one house that is going to store the data for diverse use cases. And the bet that we are going to be in a world where there’s multiple workloads and we feel like today it’s kind of converged to that point.”
Chandar spoke with theCUBE Research’s John Furrier at the Supercloud 7: Get Ready for the Next Data Platform event, during an exclusive broadcast on theCUBE, SiliconANGLE Media’s livestreaming studio. They discussed Onehouse paving the way for a future where data management is more integrated, efficient and adaptable to the evolving needs of AI-driven enterprises by prioritizing open formats, unified data layers and collaborative development.
The genesis of Onehouse and the open lakehouse model
Onehouse was founded to address the growing complexities in data management and to champion the open lakehouse model. The origin of Onehouse traces back to Uber Technologies Inc., where the team built the world’s first data lakehouse, initially termed a transactional data lake. This pioneering project evolved into Apache Hudi, a technology that underpins the open lakehouse approach, according to Chandar.
“We had only two options: we run every pipeline in a streaming mode, which costs a lot of money. It’s not even feasible to do that. Or we make our data processing on the lake smarter and more intelligent. We looked at warehouses and databases and we said if we brought some of that functionality just on top of HDFS and like a YARN compute layer … what we missed was this database abstraction on top of it. So that’s how we conceived the project.”
The ethos at Onehouse is that an open lakehouse offers a scalable and unified solution for diverse data workloads. Unlike traditional data lakes, which often result in ambiguous returns on investment, the lakehouse model promises a more cohesive and efficient data management framework. This evolution reflects the industry’s shift toward integrating data lakes and warehouses, resulting in a versatile system capable of handling various data formats and workloads, according to Chandar.
The need for unified data and the open data layer model
Data unification is foundational to the lakehouse architecture. Gen AI use cases also benefit greatly from a unified data layer rather than fragmented, siloed data sources. Unified data does not imply centralization but rather an integrated system where data from various sources can be accessed and utilized seamlessly, Chandar noted.
“If you look at the lakehouse, the story so far, it’s actually been about structured data,” he said. “So what we’ve done is adapt our warehouse capabilities, which have been more focused on structured data to the lake, but I think [in] the coming years, you will see that the lakehouse technologies [are] focused a lot more on unstructured data in a way that you can store both side by side. You have a single data management framework covering all of this.”
The concept of an open-data layer is central to this approach. By adopting open-data formats and ensuring interoperability across different data engines, organizations can achieve greater flexibility and scalability. This model aligns with the broader industry trend toward open-source solutions and collaborative development, which are crucial for fostering innovation and adaptability, according to Chandar.
“What’s broadly going on now is these four layers are getting unbundled in a way that … now we are saying, ‘Instead of a proprietary columnar format, stored data and Parquet, use one of these table formats to represent them as tables. And the SQL layer of these warehouses are proprietary engines, and governance layers can recognize these tables,'” he said. “I think that’s where we moved from 2021 to 2024. All [of] the recent news you see around the governance catalogs is essentially the next layer in the stack that is now getting a little bit unpacked.”
Stay tuned for the complete video interview, part of SiliconANGLE’s and theCUBE Research’s coverage of the Supercloud 7: Get Ready for the Next Data Platform event.
Photo: SiliconANGLE
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU