LiLIS: Enhancing Big Spatial Data Processing with Lightweight Distributed Learned Index

Abstract

The efficient management of big spatial data is crucial for location-based services, particularly in smart cities. However, existing systems such as Simba and Sedona, which incorporate distributed spatial indexing, still incur substantial index construction overheads, therefore rendering them far from optimal for real-time analytics. Recent studies demonstrate that learned indices can achieve high efficiency through a well-designed machine learning model, but how to design learned index for distributed spatial analytics remains unaddressed. In this paper, we present LiLIS, a Lightweight Distributed Learned Index for big Spatial data. LiLIS amalgamates machine-learned search strategies with spatial-aware partitioning within a distributed framework, and efficiently implements common spatial queries, including point query, range query, k-nearest neighbors (kNN), and spatial joins. Extensive experimental results over real-world and synthetic datasets show that LiLIS outperforms state-of-the-art big spatial data analytics by 2–3 orders of magnitude for most spatial queries, and the index building achieves 1.5-2x speed-up. The code is available at https://github.com/SWUFE-DB-Group/learned-index-spark.

Key Contributions

Pioneering Innovation

"The first work to introduce spatial learned index to big data analytics"

Lightweight Architecture

"The lightweight learned index enables efficient distributed spatial queries"

Performance Breakthrough

"LiLIS achieves 2–3 orders of magnitude speed-up vs state-of-the-art"

Efficient Indexing

"1.5-2.0× faster index building"

Distributed Learned Index

LiLIS follows the two-phase filtering solution, and the local index is implemented via a learned model within a given partition. By adopting error-bounded spline interpolation, which learns the two-dimensional distribution of the underlying spatial data, LiLIS achieves efficient index construction with very few parameters while ensuring prediction accuracy.

Results

Query Performance Comparison

LiLIS performance while varying datasets

LiLIS Performance Across Different Datasets

Skewed and uniform range queries under different selectivities

kNN queries in LiLIS when varying k

Index Building Efficiency

Takeaways

1. LiLIS is the fastest, and it outperforms all competitors by 2–3 orders of magnitude.

2. LiLIS is sensitive to various partitioners, and tree-based partitioners performs better than grid-based partitioners generally.

3. LiLIS is affected by data characteristics as well as the size of dataset; LiLIS consistently outperforms other competitors across all datasets.

4. Range queries in LiLIS are sensitive to the skewness, and uniform range queries run stably under different selectivities; kNN queries in LiLIS are insensitive to common k.

5. Building indices in LiLIS is faster than in competitors, though the speed-up is less significant than its query performance advantage.

BibTeX

@misc{chen2025lilis, title={LiLIS: Enhancing Big Spatial Data Processing with Lightweight Distributed Learned Index}, author={Zhongpu Chen and Wanjun Hao and Ziang Zeng and Long Shi and Yi Wen and Zhi-Jie Wang and Yu Zhao}, year={2025}, eprint={2504.18883}, url={https://arxiv.org/abs/2504.18883}, }