Curl
Curl: Private LLMs through Wavelet-Encoded Look-Up Tables
Recent advancements in transformers have revolutionized machine learning, forming the core of Large language models (LLMs). However, integrating these systems into everyday applications raises privacy concerns as client queries are exposed to model owners. Secure multiparty computation (MPC) allows parties to evaluate machine learning applications while keeping sensitive user inputs and proprietary models private. Due to inherent MPC costs, recent works introduce model-specific optimizations that hinder widespread adoption by machine learning researchers. CrypTen (NeurIPS’21) aimed to solve this problem by exposing MPC primitives via common machine learning abstractions such as tensors and modular neural networks. Unfortunately, CrypTen and many other MPC frameworks rely on polynomial approximations of the non-linear functions, resulting in high errors and communication complexity. This paper introduces Curl, an easy-to-use MPC framework that evaluates non-linear functions as lookup tables, resulting in better approximations and significant round and communication reduction. Curl exposes a similar programming model as CrypTen and is highly parallelizable through tensors. At its core, Curl relies on discrete wavelet transformations to reduce the lookup table size without sacrificing accuracy, which results in up to 19× round and communication reduction compared to CrypTen for non-linear functions such as logarithms and reciprocals. We evaluate Curl on a diverse set of LLMs, including BERT, GPT-2, and GPT Neo, and compare against state-of-the-art related works such as Iron (NeurIPS’22) and Bolt (S&P’24) achieving at least 1.9× less communication and latency. Finally, we resolve a long-standing debate regarding the security of widely used probabilistic truncation protocols by proving their security in the stand-alone model. This is of independent interest as many related works rely on this truncation style. Keywords: large language models · privacy-enhancing technologies · secure multiparty computation 1 Introduction Large language models (LLMs) like GPT-2, GPT-4 [1], BERT [19], and LLaMA [67] have emerged as prime examples showcasing the capabilities of artificial intelligence. LLMs assist individuals and businesses in everyday tasks; from machine translation [5], to text generation [32], and question answering [57], among others. To generate human-like responses, LLMs have been trained in vast amounts of data and continue learning through their interactions with users. However, as LLMs become increasingly integrated into human lives, privacy becomes a critical concern as individuals frequently share sensitive information, including names, addresses, and credit card numbers, or even financial and healthcare information [61]. On top of that, there is a shift towards personalized AI, with OpenAI enabling a memory feature to ChatGPT [29]. For an LLM-powered assistant to be able to understand individual preferences, habits, and workflows and provide tailored assistance, it would require access to an immense amount of personal data. As this personalized data is retained and used to continuously improve the LLMs, the possibility of a data breach or unauthorized access poses huge risks to individuals’ privacy. 1 Privacy-enhancing technologies (PETs) such as multiparty computation (MPC) [28,76] enable a plethora of privacy-preserving machine learning (PPML) use-cases. In the most prominent setting, a server possesses a proprietary model and aims to provide it as a service to clients for use with their private data [13,46,47,38,33]. The objective is for clients to receive only the inference results without gaining any knowledge about the model, while the model provider does not learn anything about clients’ input. Another popular PPML use case involves multiple distrusting parties to securely train a model over their joint sensitive data without revealing anything about their data to each other [53,66,45,74,40]. Several MPC systems leverage linear secret sharing schemes as they allow linear operations to be performed with minimal communication overhead [17,24]. However, non-linear operations (e.g., square roots, logarithms) often present greater challenges and require specialized techniques such as polynomial approximations [46], look-up table evaluations [70], or other protocol-specific approaches [8,16]. Furthermore, incorporating domain knowledge can significantly enhance the efficiency and effectiveness of these protocols by tailoring the parameters to better fit specific use cases [36,48,58]. 1.1 Related Work 1.1.1 Focusing on Function Secret Sharing Recently, several secure computation works have emerged leveraging function secret sharing (FSS) [3,70,65,56,55]. Pika [70] extends the prior work of [3] by showcasing a novel approach to securely evaluate look-up tables (LUTs) through FSS. While it demonstrates efficacy in benchmarking against popular datasets like MNIST and CIFAR-10, its scalability poses challenges. As noted by Grotto [65], for large LUTs, the computational cost may render the protocol infeasible due to the extensive number of distributed point functions (DPF) evaluations required. In contrast, Curl circumvents this challenge by eliminating the need for DPF eval uations, thanks to the integration of the discrete wavelet transform (DWT) technique. On top of that, our technique can also benefit FSS-based frameworks by reducing the size of the LUTs, resulting in less computation and communication. On the other hand, Grotto [65] introduces innovative protocols leveraging custom splines and DPFs to handle a subset of functions efficiently. Yet, it faces challenges in terms of computational and communication overhead compared to alternatives like Sigma [33]. Orca [40] showcases the potential of GPU acceleration in FSS protocols, particularly tailored for convolutional neural networks (CNNs). However, its suitability for other architectures like transformers is questioned due to its reliance on heavy non-linearities [40]. Sigma [33] builds on top of Orca and distinguishes itself by relying on minimal-sized LUTs through protocols tailored for small ranges. Despite its efficiency from custom protocols with small-sized LUTs, it demands significant computing and communication resources (e.g., 1 TB RAM and around 9Gbps communication link). Furthermore, its use of deterministic truncation, although claiming improved security, falls short in speed compared to probabilistic truncation methods. Unfortunately, all FSS-based works focus on the two-party setting as FSS becomes impractical with more than two parties. 1.1.2 Focusing on Preprocessing The line of work initiated by Pika [70] focuses on the two-party setting with an additional dealer party. In this scenario, the preprocessing phase necessitates the dealer to prepare and distribute an O(n)-sized element. Given the assumption that the dealer will not collude with any party, reducing dependency on such assumptions is desirable. This has led to the development of protocols that eliminate the need for a dealer. Notably, some previous works have aimed to achieve this in the two-party setting, enabling both parties to independently generate the required preprocessing material [39,18,6]. The One-Time Truth Table (OTTT) protocol [39] employs a boolean circuit to represent the table for every possible input, resulting in an exponential complexity relative to the bit size. The OP-LUT protocol [18] attempts to enhance the OTTT preprocessing phase but only achieves improvements for small bit sizes, with the communication cost remaining exponential. The SP-LUT protocol [18] significantly enhances the preprocessing phase but modifies the online phase, leading to an exponential communication cost in the 2 bit size during the online phase. FLUTE [6] offers advancements over previous works but still requires O(2n) communication in the setup phase. Therefore, finding a preprocessing phase with sub-exponential communication complexity while maintaining the efficiency of the online phase remains an open challenge. 1.1.3 Focusing on Fully-Homomorphic Encryption Fully homomorphic encryption (FHE) schemes have benefited significantly from the use of LUTs to enhance performance and enable new applications. This concept was initially proposed by Ducas and Micciancio [22], who devised a method to evaluate arbitrary binary gates in FHE using LUTs. Building on this foundation, Chillotti et al. [10] implemented the evaluation of arbitrary functions through a tree of leveled multiplexers, leading to the development of the Torus FHE (TFHE) scheme. Despite their development of a fast boot strapping mechanism, capable of execution in approximately 10 milliseconds on a CPU, its adoption has been limited. This is primarily because the control inputs for the multiplexers required fresh ciphertexts meaning prior computation on them was not possible– and the approach necessitated expressing programs as deterministic automata. Further improvement over TFHE was presented in [11] with the introduction of the programmable bootstrapping (PBS) technique, allowing for efficient and general-purpose LUT evaluation. To enhance adoptability, HELM [30] expanded on the PBS technique and introduced a framework for auto matically converting Verilog hardware description language (HDL) into encrypted circuits. HELM employs three modes of operation. The first mode exclusively processes binary gates. The second mode operates on integers using secure LUT evaluations. The third, mixed mode, works with binary circuits and ”bridges” over to integers for secure LUT evaluations before returning to the binary domain. However, HELM is limited to very low-precision LUTs. This limitation arises because converting an integer to multiple bits requires multiple n-to-1 LUTs. Recently, Ripple [31] proposed using compression techniques based on discrete wavelet transform (DWT) theory to decrease the size of LUT, thereby accelerating the PBS technique for general smooth functions. 1.1.4 Focusing on Secure LLM Inference
Last updated