publications

* denotes equal contribution and joint lead authorship.
  1. Is Quantization a Deal-breaker? Empirical Insights from Large Code Models.
    S. Afrin, B. Xu, and A. Mastropaolo.

    In arXiv preprint arXiv:2507.09665 2025.

    The growing scale of large language models (LLMs) not only demands extensive computational resources but also raises environmental concerns due to their increasing carbon footprint. Model quantization emerges as an effective approach that can reduce the resource demands of LLMs by decreasing parameter precision without substantially affecting performance (e.g., 16 bit to 4 bit). While recent studies have established quantization as a promising approach for optimizing large code models (LCMs), a specialized subset of LLMs tailored for automated software engineering, their findings offer only limited insights into its practical implications. Specifically, current investigations focus only on the functional correctness of the code generated by quantized models, neglecting how quantization impacts critical aspects of code quality such as reliability, maintainability, and security. To bridge this gap, our study investigates the effects of quantization on the qualitative aspects of automatically generated code. We apply Activation-aware Weight Quantization (AWQ) to two widely used code models, CodeLlama and DeepSeekCoder, to generate Java and Python code. Using state-of-the-art static analysis tools, we evaluate software quality metrics and static features including cyclomatic complexity, cognitive complexity, and lines of code. Our findings reveal that quantization is a robust technique that not only preserves functional correctness, but also retains key qualitative code attributes sought after by developers, such as maintainability and structural simplicity.
  2. Resource-efficient & effective code summarization.
    S. Afrin, J. Call, K.-N. Nguyen, O. Chaparro, and A. Mastropaolo.

    In 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge) 2025.

    Code Language Models (CLMs) have demonstrated high effectiveness in automating software engineering tasks such as bug fixing, code generation, and code documentation. This progress has been driven by the scaling of large models, ranging from millions to trillions of parameters (e.g., GPT-4). However, as models grow in scale, sustainability concerns emerge, as they are extremely resource-intensive, highlighting the need for efficient, environmentally conscious solutions. GreenAI techniques, such as QLoRA (Quantized Low-Rank Adaptation), offer a promising path for dealing with large models' sustainability as they enable resource-efficient model fine-tuning. Previous research has shown the effectiveness of QLoRA in code-related tasks, particularly those involving natural language inputs and code as the target output (NL-to-Code), such as code generation. However, no studies have explored its application to tasks that are fundamentally similar to NL-to-Code (natural language to code) but operate in the opposite direction, such as code summarization. This leaves a gap in understanding how well QLoRA can generalize to Code-to-NL tasks, which are equally important for supporting developers in understanding and maintaining code. To address this gap, we investigate the extent to which QLoRA's capabilities in NL-to-Code tasks can be leveraged and transferred to code summarization, one representative Code-to-NL task. Our study evaluates two state-of-the-art CLMs (CodeLlama and DeepSeek-Coder) across two programming languages: Python and Java. Our research tasked models with generating descriptions for Python and Java code methods. The results align with prior findings on QLoRA for source code generation, showing that QLoRA enables efficient fine-tuning of CLMs for code summarization.
  3. Quantizing large language models for code generation: A differentiated replication.
    A. Giagnorio, A. Mastropaolo, S. Afrin, M. Di Penta, and G. Bavota.

    In arXiv preprint arXiv:2503.07103 2025.

    Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language. The LLM effectiveness generally increases with its size: The higher the number of LLM's trainable parameters the better its ability to implement code. However, when it comes to deploying LLM-based code generators, larger LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint. A previous work by Wei et al. proposed to leverage quantization techniques to reduce the memory footprint of LLM-based code generators without substantially degrading their effectiveness. In short, they studied LLMs featuring up to 16B parameters, quantizing their precision from floating point 32 bits down to int 8 bits and showing their limited impact on code generation performance. Given the fast pace at which LLM capabilities and quantization techniques are evolving, in this work we present a differentiated replication of the work by Wei et al. in which we consider (i) on the one side, more recent and larger code-related LLMs, of up to 34B parameters; (ii) the latest advancements in model quantization techniques, which allow pushing the compression to the extreme quantization level of 2 bits per model parameter and; (iii) different types of calibration datasets to guide the quantization process, including code-specific ones. Our empirical evaluation reveals that the new frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70% compared to the original model without observing any significant decrease in performance. Additionally, when the quantization becomes even more extreme (3 and 2 bits), a code-specific calibration dataset helps to limit the loss of performance.
  4. A systematic literature review of parameter-efficient fine-tuning for large code models.
    S. Afrin*, M. Z. Haque*, and A. Mastropaolo.

    In arXiv preprint arXiv:2504.21569 2025.

    The rise of Artificial Intelligence (AI)-and particularly Large Language Models (LLMs) for code-has reshaped Software Engineering (SE) by enabling the automation of tasks such as code generation, bug detection, and repair. However, these models require significant computational resources for training and fine-tuning, posing challenges for real-world adoption in resource-constrained environments. To address this, the research community has increasingly turned to Parameter-Efficient Fine-Tuning (PEFT)-a class of techniques that enables the adaptation of large models by updating only a small subset of parameters, rather than the entire model. In this Systematic Literature Review (SLR), we examine the growing application of PEFT techniques-across a wide range of software engineering tasks. We analyze how these methods are used to optimize various deep learning (DL) architectures, focusing on their impact on both performance and efficiency. Our study synthesizes findings from 28 peer-reviewed papers, identifying patterns in configuration strategies and adaptation trade-offs. The outcome of this review is a comprehensive taxonomy that categorizes PEFT usage by task type, distinguishing between generative (e.g., Code Summarization) and non-generative (e.g., Code Clone Detection) scenarios. Our findings aim to inform future research and guide the practical deployment of PEFT in sustainable, AI-powered software development. Our artifacts are publicly available at this https URL
  5. Single-gpu gnn systems: Traps and pitfalls.
    Y. Gong, A. Tarafder, S. Afrin, and P. Kumar.

    In arXiv preprint arXiv:2402.03548 2024.

    The current graph neural network (GNN) systems have established a clear trend of not showing training accuracy results, and directly or indirectly relying on smaller datasets for evaluations majorly. Our in-depth analysis shows that it leads to a chain of pitfalls in the system design and evaluation process, questioning the practicality of many of the proposed system optimizations, and affecting conclusions and lessons learned. We analyze many single-GPU systems and show the fundamental impact of these pitfalls. We further develop hypotheses, recommendations, and evaluation methodologies, and provide future directions. Finally, a new reference system is developed to establish a new line of optimizations rooted in solving the system-design pitfalls efficiently and practically. The proposed design can productively be integrated into prior works, thereby truly advancing the state-of-the-art.

2024-2025

  1. Supervised machine learning based liver disease prediction approach with lasso feature selection.
    S. Afrin, F. J. M. Shamrat, T. I. Nibir, and et al.

    In Bulletin of Electrical Engineering and Informatics vol. 10, no. 6, pp. 3369–3376, 2021.

    In this contemporary era, the uses of machine learning techniques are increasing rapidly in the field of medical science for detecting various diseases such as liver disease (LD). Around the globe, a large number of people die because of this deadly disease. By diagnosing the disease in a primary stage, early treatment can be helpful to cure the patient. In this research paper, a method is proposed to diagnose the LD using supervised machine learning classification algorithms, namely logistic regression, decision tree, random forest, AdaBoost, KNN, linear discriminant analysis, gradient boosting and support vector machine (SVM). We also deployed a least absolute shrinkage and selection operator (LASSO) feature selection technique on our taken dataset to suggest the most highly correlated attributes of LD. The predictions with 10 fold cross-validation (CV) made by the algorithms are tested in terms of accuracy, sensitivity, precision and f1-score values to forecast the disease. It is observed that the decision tree algorithm has the best performance score where accuracy, precision, sensitivity and f1-score values are 94.295%, 92%, 99% and 96% respectively with the inclusion of LASSO. Furthermore, a comparison with recent studies is shown to prove the significance of the proposed system.
  2. Industrial fault detection using transfer learning models.
    S. Chakraborty, F. J. M. Shamrat, S. Afrin, S. Saha, I. Ahmed, and S. Thapa.

    In 2021 2nd International Conference on Smart Electronics and Communication (ICOSEC) IEEE, 2021, pp. 1–6.

    Industry and equipment are critical factors in the advancement of human society in the era of the industrial revolution. Since factories are reliant on their machines, they must be maintained daily. However, if the machines are too large for us to observe, an automated process is required to monitor it. By diagnosing the signal data using the CNN algorithm, faults in the machines can be identified. This paper has proposed three transfer learning-based fault diagnosis models using AlexNet, InceptionV3, GoogLeNet with the pretrained weights of the ImageNet dataset. The results of the classification of the three models are compared for their performance. It is observed from the study that the proposed AlexNet architecture shows a very high performance by classifying faults in machines for the tested dataset compared to other models.
  3. Expert cancer model using supervised algorithms with a lasso selection approach.
    P. Ghosh, A. Karim, S. T. Atik, S. Afrin, and M. Saifuzzaman.

    In International Journal of Electrical and Computer Engineering vol. 11, no. 3, pp. 2632–2640, 2021.

    One of the most critical issues of the mortality rate in the medical field in current times is breast cancer. Nowadays, a large number of men and women are facing cancer-related deaths due to the lack of early diagnosis systems and proper treatment per year. To tackle the issue, various data mining approaches have been analyzed to build an effective model that helps to identify the different stages of deadly cancers. The study successfully proposes an early cancer disease model based on five different supervised algorithms such as logistic regression (henceforth LR), decision tree (henceforth DT), random forest (henceforth RF), Support vector machine (henceforth SVM), and K-nearest neighbor (henceforth KNN). After an appropriate preprocessing of the dataset, least absolute shrinkage and selection operator (LASSO) was used for feature selection (FS) using a 10-fold cross-validation (CV) approach. Employing LASSO with 10-fold cross-validation has been a novel steps introduced in this research. Afterwards, different performance evaluation metrics were measured to show accurate predictions based on the proposed algorithms. The result indicated top accuracy was received from RF classifier, approximately 99.41% with the integration of LASSO. Finally, a comprehensive comparison was carried out on Wisconsin breast cancer (diagnostic) dataset (WBCD) together with some current works containing all features.
  4. Optimization of prediction method of chronic kidney disease using machine learning algorithm.
    P. Ghosh, F. J. M. Shamrat, S. Shultana, S. Afrin, A. A. Anjum, and A. A. Khan.

    In 2020 15th international joint symposium on artificial intelligence and natural language processing (iSAI-NLP) IEEE, 2020, pp. 1–6.

    Chronic Kidney disease (CKD), a slow and late-diagnosed disease, is one of the most important problems of mortality rate in the medical sector nowadays. Based on this critical issue, a significant number of men and women are now suffering due to the lack of early screening systems and appropriate care each year. However, patients’ lives can be saved with the fast detection of disease in the earliest stage. In addition, the evaluation process of machine learning algorithm can detect the stage of this deadly disease much quicker with a reliable dataset. In this paper, the overall study has been implemented based on four reliable approaches, such as Support Vector Machine (henceforth SVM), AdaBoost (henceforth AB), Linear Discriminant Analysis (henceforth LDA), and Gradient Boosting (henceforth GB) to get highly accurate results of prediction. These algorithms are implemented on an online dataset of UCI machine learning repository. The highest predictable accuracy is obtained from Gradient Boosting (GB) Classifiers which is about to 99.80% accuracy. Later, different performance evaluation metrics have also been displayed to show appropriate outcomes. To end with, the most efficient and optimized algorithms for the proposed job can be selected depending on these benchmarks.

2020-2021