Workshop on Cloud Databases (CloudDB)

Workshop Overview

Cloud providers and database vendors are investing heavily in the development of competitive cloud database offerings, with the goal of providing optimal performance in a cost-effective way for cloud customers. The database community has already contributed significantly towards developing cloud-native OLTP and OLAP databases. Cloud computing pools an abundance of resources and offers them in a pay-as-you-go model. Due to this unique computational environment and business model, there are various research challenges that need to be addressed. These challenges require new research into resource disaggregation, serverless database services, and data movement across multiple cloud providers. Additionally, existing database research topics require re-evaluation such as:

Multitenancy

How to manage and efficiently make use of cloud resources (CPU, memory, and network/storage I/O) to support multiple tenants with different SLA requirements.
Autonomous Databases

How to automate database tuning and physical design (e.g., data compression, range partitioning scheme, buffer management policy) based on dynamic cloud workloads.
Resource Usage Prediction

How to accurately predict resource usage of workloads to manage the cluster of database resources for different types of workloads, TP, AP, etc.
Query Optimizer for Cloud Databases

How to leverage cloud workloads and resources to design better query optimizers (cardinality estimation, cost model, plan enumeration).
Disaggregation

How to leverage different layers of caches (local cache vs ephemeral storage pool) to accelerate queries.
CloudDBA

How to assist customers in monitoring and optimizing cloud database performance and, when a failure happens, quickly identifying and fixing the failure.

Our workshop aims to bring together researchers and practitioners from both academia and industry to discuss these challenges as well as possible directions to tackle them. Specifically, the workshop has three objectives. Firstly, it provides a platform for researchers to present their latest research results in the area of cloud databases. Secondly, it provides an opportunity for practitioners to assess the results and provide feedback. Thirdly, and most importantly, it helps researchers and practitioners to build connections and explore potential collaboration.

Topics of Interest The suggested topics of interest include, but are not limited to:

Disaggregation
Transaction and Recovery
Query Optimizer
Serverless Database Services

Multitenancy
Autonomous Databases
CloudDBA
Security
HTAP

Organization

Workshop Co-Chairs

Agenda

Time

Topic

Speakers

08:00 - 09:00

Light Breakfast - 3F Great Hall Foyer & Ballroom Foyer

09:00 - 09:45

Keynote 1: Taming (some) Heterogeneity in the Cloud

Wolfgang Lehner TUD

09:45 - 10:15

Corra: Correlation-Aware Column Compression

Hanwen Liu TUM
Mihail Stoian UTN
Alexander van Renen UTN
Andreas Kipf UTN

10:30 - 11:00

Coffee Break - 3F Ballroom Foyer

11:00 - 12:30

Panel: Cloud-based Data Management and AI

Panelist: Wolfgang Lehner, Viktor Leis, Alvin Cheung, Ji Sun

12:30 - 14:00

Lunch Break - 4F The Open Kitchen & Al fresco

14:00 - 15:00

Keynote 2: Co-Designing Cloud-Native Database Systems and Unikernels: Reimagining OS Abstractions for Modern Hardware

Viktor Leis TUM

15:00 - 15:45

Keynote 3: What Do We Really Need For Vector Databases in the LLM Era?

Ji Sun Huawei

15:45 - 16:00

Coffee Break - 3F Great Hall Foyer & Ballroom Foyer

16:00 - 17:00

Keynote 4: Analyzing data-intensive cloud applications for fun and profit

Alvin Cheung UC Berkeley

17:00 - 17:30

MetaHive: A Cache-Optimized Metadata Management for Heterogeneous Key-Value Stores

Alireza Heidari Huawei
Amirhosssein Ahmadi Huawei Cloud
Zefeng Zhi Huawei
Wei Zhang Huawei Technologies Canada

Keynotes

Keynote 1
Taming (some) Heterogeneity in the Cloud

Wolfgang Lehner

Professor (TUD)

Abstract: Cloud infrastructure exhibits diversity across its various layers and components, necessitating careful consideration when designing efficient data systems. This presentation will explore different instances of Cloud heterogeneity, using the Daphne system as a case study. Daphne is a European Union project aimed at creating an open, extensible infrastructure for integrated data analysis pipelines. We'll then delve deeper into TSL, a framework that semi-automatically generates hardware-specific code for different processing units, as a key example of addressing this heterogeneity.

Bio: Wolfgang Lehner is professor and head of the database technology research group at TUD, Dresden University of Technology as well as professor at AAU, Aalborg University. He is interested in cross-cutting aspects of data management systems, from complex analytical algorithms to efficient implementation using standard and not-so-standard devices. He is serving the international database community in many roles (e.g. PVLDB Managing Editor) and member of the Academy of Europe as well as an active member of the German Science and Humanities Council.

Keynote 2
Co-Designing Cloud-Native Database Systems and Unikernels: Reimagining OS Abstractions for Modern Hardware

Viktor Leis

Professor (TUM)

Abstract: Although the idea of custom, DBMS-optimized OS kernels is old, it is largely unrealized due to the demands of hardware compatibility and the reluctance of users to install specialized operating systems. However, the cloud and the database-as-a-service model make custom OS kernels realistic for the first time. Among specialized OS kernel architectures, unikernels stand out for relying on a single address space, eliminating the need for costly process isolation that is provided by general-purpose operating systems. They offer benefits such as the elimination of system call overhead, direct access to hardware, and reduced complexity. Beyond these immediate advantages, unikernels offer a unique opportunity: the possibility to revisit dated POSIX APIs. By allowing direct interaction with modern hardware primitives, unikernels pave the way for the development of novel abstractions that are not confined to the limitations of older APIs, opening doors to a new era of co-designed, high-performance cloud-native data processing systems and OS kernels.

Bio: Viktor Leis is a professor in the Computer Science Department at TUM, leading the chair for Decentralized Information Systems and Data Management. His research revolves around designing cost-efficient data systems for the cloud and includes core database systems topics such as query processing, query optimization, transaction processing, index structures, and storage.

Keynote 3
What Do We Really Need For Vector Databases in the LLM Era?

Ji Sun

Database Scientist, Huawei

Abstract: Currently, vector database is becoming a basic infrastructure for LLM applications, and it has attracted investment for over 200 million dollars in last two years. Vector database does the semantic search based on embeddings of real entities (e.g. documents, paragraphs), and aims to solve the illusion problem of LLM apps. Some of the techniques adopted in vector databases are borrowed from high dimensional ANN researches in DB and AI fields, like hashing and navigate graph. Recently, researches in academic and industry focus on index optimization, storage design for large-scale vectors, insert&update approaches, scalar/vector hybrid query optimization, multi-vector query approaches, hardware acceleration, etc. In this keynote, the speaker will introduce these hotspot techniques from the view of practical RAG tasks, and discuss the evolution directions of vector databases in the future.

Bio: Ji Sun is a database scientist in Huawei, leading the development of AI-Native Database project for GaussDB. His research interests lie in Vector database, In-database ML, and AI-based database optimization. He got Ph.D degree from Tsinghua University.

Keynote 4
Analyzing Data-intensive Cloud Applications for Fun and Profit

Alvin Cheung

Associate Professor, UC Berkeley

Abstract: Detecting relationships in persistent data such as functional dependencies has been a standard research topic in the database community. However, nowadays much of such data is generated by applications such as those running the cloud (e.g., web apps), and this presents a new opportunity for deriving data relationships. In this talk, I will discuss the analysis techniques for data relationships that we have developed for such applications (the fun), and how we have been exploiting such relationships to develop new domain-specific languages for such applications, optimize query execution, and generate application-specific data representations (the profit).

Bio: Alvin Cheung is an associate professor in the EECS department at UC Berkeley, where his group works on data management and programming language research. Work from his group has received a number of best paper awards in different venues, along with early career research awards from the VLDB Endowment and the IEEE Technical Committee on Data Engineering.

Important Dates The important dates are listed below:

Paper submission

June 15th, 2024 (AoE) [Extended]
Notification of acceptance

July 5th, 2024
Camera-ready submission

July 31st, 2024
Workshop

August 30th, 2024

Submission

https://cmt3.research.microsoft.com/CloudDB2024

Submission Instructions

We accept both long papers (limited to 12 pages + unlimited space for references) and short papers (limited to 6 pages + unlimited space for references).

Submissions are to be formatted following the standard VLDB template available at: https://www.vldb.org/2025/?formatting-guidelines

Submissions will be reviewed in a single-blind manner. Each submission must include all author names and affiliations. We will use CMT’s conflict management system. All the authors of a submission must declare their conflicts before the paper submission deadline. Papers with incorrect or incomplete CoI information as of the submission closing time will be subject to desk rejection.