Projects | DSG Current Projects<br>A Declarative System for Optimizing AI Workloads
A long-standing goal of data management systems has been to build systems which can compute quantitative insights over large corpora of unstructured data in a cost-effective manner. Until recently, it was difficult and expensive to extract facts from company documents, data from scientific papers, or metrics from image and video corpora. Today's models can accomplish these tasks with high accuracy. However, a programmer who wants to answer a substantive AI-powered query must orchestrate large numbers of models, prompts, and data operations. For even a single query, the programmer has to make a vast number of decisions such as the choice of model, the right inference method, the most cost-effective inference hardware, the ideal prompt design, and so on. The optimal set of decisions can change as the query changes and as the rapidly-evolving technical landscape shifts. In this paper we present Palimpzest, a system that enables anyone to process AI-powered analytical queries simply by defining them in a declarative language. The system uses its cost optimization framework -- which explores the search space of AI models, prompting techniques, and related foundation model optimizations -- to implement the query plan with the best trade-offs between runtime, financial cost, and output data quality. We describe the workload of AI-powered analytics tasks, the optimization methods that Palimpzest uses, and the prototype system itself. We evaluate Palimpzest on tasks in Legal Discovery, Real Estate Search, and Medical Schema Matching. We show that even our simple prototype offers a range of appealing plans, including one that is 3.3x faster and 2.9x cheaper than the baseline method, while also offering better data quality. With parallelism enabled, Palimpzest can produce plans with up to a 90.3x speedup at 9.1x lower cost relative to a single-threaded GPT-4 baseline, while obtaining an F1-score within 83.5% of the baseline. These require no additional work by the user.
DejaVid<br>We propose a novel framework for Semantic Video Retrieval (SVR), where we aim to find videos within a corpus that are semantically similar to a given query video. Difficulties with this problem include identifying semantically relevant events in a video and matching events in videos despite events spanning different durations. One promising technique is Dynamic Time Warping (DTW), which is temporal deformation-invariant but typically only supports low-dimensional data. In this work, we propose a DTW-augmented neural network architecture that learns the semantic relevance of events and features in a video, enabling general-purpose SVR without hand-coded events or features.
LucidScript<br>Data preparation has been seen as "janitor work" yet essential in data-to-insight pipelines. The increasing liberality of data is followed by an explosion in the diversity of data consumers. However, the required technical and domain expertise prevents many from performing extensive data preparation. Further, many seem to be stuck in a vicious cycle of writing one-off programs to process data. Recently, automating data preparation programs has been shown to improve many aspects of the pipeline, including data quality, research reproducibility, and user productivity. We propose a novel approach to automatically improve data preparation programs.
ML for Systems<br>Our vision for research on ML for Systems is laid out in SageDB, a new type of data processing system that highly specializes to a particular application through code synthesis and machine learning. This vision is also a focus of MIT DSAIL.<br>Here, we provide an overview of data systems components that we are currently working on, with more detailed project descriptions in the links, as well as a list of open-source repositories.<br>For high-level descriptions of our research, you can check out our Learned Systems Blog.
ML for Systems Papers<br>If you want to find out more about the exciting work in the area of ML for Systems, we have also compiled a list of ML for Systems Papers. This list is incomplete. If we are missing a paper, please email mlsyspapers@lists.csail.mit.edu and we will include it. If you would like to be informed about new research papers, subscribe here.
Practical DB-OS Co-design with Privileged Kernel Bypass
We revisits the longstanding challenge of coordinating database systems with general-purpose OS interfaces, such as POSIX, which often lack tailored support for data-intensive workloads. Existing approaches to this DB-OS co-design struggle with limited design space, security risks, and compatibility issues. To overcome these hurdles, we propose a new co-design approach leveraging virtualization to elevate the privilege level of DB processes. Our method enables database systems to fully exploit hardware capabilities via virtualization, while minimizing the need for extensive modifications to the host OS...