LS-PRISM: A layer-selective pruning method via low-rank approximation and sparsification for efficient large language model compression.

Renshuai Tao , Hairong Chen , Yuzhe Guo , Jiakai Wang , Boying Wang , Rongrong Ni , Yao Zhao

Neural Netw

Beijing Jiaotong University, Beijing, 100044, China. Electronic address:

Published: July 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Large Language Models (LLMs) have significantly advanced natural language processing (NLP), establishing new benchmarks across a wide range of tasks. However, their large parameter sizes present challenges for deployment on resource-constrained devices. Current compression techniques often treat all layers uniformly, disregarding the operational differences across layers, which can lead to varying effects on performance. In this paper, we introduce a novel method named Layer-Selective Pruning via low-Rank Approximation and Sparsification Method (LS-PRISM), efficiently compressing LLMs while preserving their performance on key NLP benchmarks, such as BoolQ, RTE, and ARC-Challenge. LS-PRISM dynamically applies low-rank approximation to selected matrices within each model layer based on their impact on accuracy and loss, with ranks adaptively determined using Dynamic Rank Selection method, retaining approximations that improve performance and leaving others unaltered. Additionally, we employ unstructured pruning on the remaining matrices to further sparsify the model, followed by optional fine-tuning using LoRA to recover performance loss. Experimental results demonstrate that LS-PRISM achieves significant reductions in both parameter count and storage, with minimal degradation in accuracy. Specifically, for a 2.5B parameter LLM, we achieve up to a 12 % reduction in parameters, while maintaining performance comparable to the original model. We also explore the method's applicability to even smaller models, and discuss the observed performance differences. LS-PRISM offers a scalable and effective solution for compressing LLMs in resource-constrained environments.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.neunet.2025.107909	DOI Listing

Publication Analysis

Top Keywords

low-rank approximation

layer-selective pruning

approximation sparsification

large language

compressing llms

performance

ls-prism

ls-prism layer-selective

method

pruning method

Similar Publications

Reduced storage direct tensor ring decomposition for convolutional neural networks compression.

Neural Netw

August 2025

Faculty of Electronics, Photonics, and Microsystems, Wroclaw University of Science and Technology, Wybrzeze Wyspianskiego 27, Wroclaw, 50-370, Poland.

Mateusz Gabor , Rafał Zdunek

Convolutional neural networks (CNNs) are among the most widely used machine learning models for computer vision tasks, such as image classification. To improve the efficiency of CNNs, many compression approaches have been developed. Low-rank methods approximate the original convolutional kernel with a sequence of smaller convolutional kernels, leading to reduced storage and time complexities.

View Article and Find Full Text PDF

Similar Publications

Low-Rank Approximations for Accurate and Efficient Plane-Wave Second-Order Møller-Plesset Perturbation Theory.

J Chem Theory Comput

August 2025

State Key Laboratory of Precision and Intelligent Chemistry, and Department of Chemical Physics, University of Science and Technology of China, Hefei, Anhui 230026, China.

Zhaolong Luo , Xinming Qin , Wei Hu , Jinlong Yang

The second-order Møller-Plesset perturbation (MP2) theory is a post-Hartree-Fock method widely used to describe weak correlation energies in solids and molecules, but its high computational cost scales as (). Herein, we present an accurate and efficient implementation of MP2 within the plane-wave (PW) basis set for both periodic and molecular systems, which incorporates the interpolative separable density fitting (ISDF) decomposition and the Laplace transformation (LT) of the energy denominator. These innovations avoid the direct construction of electron repulsion integrals (ERIs) and reduce the computational complexity of MP2 from () to ().

View Article and Find Full Text PDF

Similar Publications

Identification of Low Order Systems in a Loewner Framework.

IFAC Pap OnLine

September 2024

ECE Dept., Northeastern University, Boston, MA 02115 USA.

Arya Honarpisheh , Rajiv Singh , Jared Miller , Mario Sznaier

This paper considers the problem of non-parametric identification of low-order models from time-domain experimental data using a combination of Caratheodory Fejer and Loewner-based interpolation, followed by a Loewner matrix Balanced Reduction (LBR) step. As we show in the paper, the Loewner matrix is an estimator for the trace norm of a system, playing a role similar to the one played by the Hankel matrix. However, utilizing Zolotarev numbers to establish decay rate bounds for singular values reveals that the decay of singular values in the Loewner matrix is considerably faster than that in the Hankel matrix.

View Article and Find Full Text PDF

Similar Publications

A dynamic reconstruction and motion estimation framework for cardiorespiratory motion-resolved real-time volumetric MR imaging (DREME-MR).

Phys Med Biol

August 2025

Department of Radiation Oncology, UT Southwestern Medical Center, 2280 inwood road, Dallas, Texas, 75235, UNITED STATES.

Hua-Chieh Shao , Xiaoxue Qian , Guoping Xu , Can Wu , Ricardo Otazo

Based on a 3D pre-treatment magnetic resonance (MR) scan, we developed DREME-MR to jointly reconstruct the reference patient anatomy and solve a data-driven, patient-specific cardiorespiratory motion model. Via a motion encoder simultaneously learned during the reconstruction, DREME-MR further enables real-time volumetric MR imaging and cardiorespiratory motion tracking with minimal intra-treatment k-space data. Approach: DREME-MR integrates dynamic MRI reconstruction and real-time MR imaging into a unified, dual-task learning framework.

View Article and Find Full Text PDF

Similar Publications

Inferring diffusion, reaction, and exchange parameters from imperfect FRAP.

Biophys J

August 2025

LPTMS, CNRS, Université Paris-Saclay, Orsay, France. Electronic address:

Enrico Lorenzetti , Celia Municio-Diaz , Nicolas Minc , Arezki Boudaoud , Antoine Fruleux

Fluorescence recovery after photobleaching (FRAP) is broadly used to investigate the dynamics of molecules in cells and tissues, notably to quantify diffusion coefficients. FRAP is based on the spatiotemporal imaging of fluorescent molecules after an initial bleaching of fluorescence in a region of the sample. Although a large number of methods have been developed to infer kinetic parameters from experiments, it is still a challenge to fully characterize molecular dynamics from noisy experiments in which diffusion is coupled to other molecular processes or in which the initial bleaching profile is not perfectly controlled.

View Article and Find Full Text PDF

Similar Publications