A Comparison of End-to-End Decision Forest Inference Pipelines

Hong Guan, Saif Masood, Mahidhar Dwarampudi, Venkatesh Gunda, Hong Min, Lei Yu, Soham Nag, Jia Zou

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Decision forest, including RandomForest, XGBoost, and LightGBM, dominates the machine learning tasks over tabular data. Recently, several frameworks were developed for decision forest inference, such as ONNX, TreeLite from Amazon, TensorFlow Decision Forest from Google, HummingBird from Microsoft, Nvidia FIL, and lleaves. While these frameworks are fully optimized for inference computations, they are all decoupled with databases and general data management frameworks, which leads to cross-system performance overheads. We first provided a DICT model to understand the performance gaps between decoupled and in-database inference. We further identified that for in-database inference, in addition to the popular UDF-centric representation that encapsulates the ML into one User Defined Function (UDF), there also exists a relation-centric representation that breaks down the decision forest inference into several fine-grained SQL operations. The relation-centric representation can achieve significantly better performance for large models. We optimized both implementations and conducted a comprehensive benchmark to compare these two implementations to the aforementioned decoupled inference pipelines and existing in-database inference pipelines such as SparkSQL and PostgresML. The evaluation results validated the DICT model and demonstrated the superior performance of our in-database inference design compared to the baselines.

Original languageEnglish (US)
Title of host publicationSoCC 2023 - Proceedings of the 2023 ACM Symposium on Cloud Computing
PublisherAssociation for Computing Machinery, Inc
Pages200-215
Number of pages16
ISBN (Electronic)9798400703874
DOIs
StatePublished - Oct 30 2023
Event14th ACM Symposium on Cloud Computing, SoCC 2023 - Santa Cruz, United States
Duration: Oct 30 2023Nov 1 2023

Publication series

NameSoCC 2023 - Proceedings of the 2023 ACM Symposium on Cloud Computing

Conference

Conference14th ACM Symposium on Cloud Computing, SoCC 2023
Country/TerritoryUnited States
CitySanta Cruz
Period10/30/2311/1/23

Keywords

  • Decision Forest
  • Machine Learning System

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'A Comparison of End-to-End Decision Forest Inference Pipelines'. Together they form a unique fingerprint.

Cite this