
Related items loading ...
Publication Type
Thesis
Authorship
Nafi, K. W.
Title
Exploring Cross-Language Software Similarity Analysis Using Source Code Context
Year
2026
Publication Outlet
Department of Computer Science, University of Saskatchewan
DOI
Citation
Nafi, K. W. (2026) Exploring Cross-Language Software Similarity Analysis Using Source Code Context, Department of Computer Science, University of Saskatchewan
https://hdl.handle.net/10388/18057
Abstract
The rapid growth of multi-language and cross-platform software development has created an urgent need for effective techniques to identify functional similarity across programming languages. Developers routinely reuse or reimplement functionally similar code blocks across cross-language and multilingual software systems, resulting in intentional and unintentional cross-language similar code fragments, as well as the adaptation of APIs and libraries that serve similar purposes but are implemented in different languages. While these adaptations can improve portability and broaden software reach to various users, they also increase development cost, maintenance complexity, and the potential for inconsistency or defects. Despite recent advances in machine learning, code representation learning, and Large Language Models (LLMs), existing approaches for cross-language software similarity often struggle with deep syntactic reasoning, diverse coding styles, and limited availability of high-quality multi-lingual code datasets.
This thesis is grounded in the premise that accurate detection of cross-language code similarity can significantly mitigate longstanding challenges in cross-language software development and maintenance. Motivated by this premise, the thesis investigates the foundational problem of establishing reliable, robust cross-language code-similarity measures. The proposed investigation aims to support a wide range of software engineering tasks, including single-language, cross-language, and multi-language development and maintenance activities. Drawing on a comprehensive systematic literature review, the thesis identifies key limitations in the state of the art and proposes five complementary contributions across four levels of code granularity. First, it introduces a universal software similarity detector (CroLSim) that categorizes cross-language software applications by leveraging API call documentation similarity. Second, it presents a source code feature-driven and API documentation-adapted cross-language clone detection model (CLCDSA) that combines syntactic features with API documentation semantics similarity to identify cross-language clones more accurately. Third, it develops an LLM-guided, multimodal framework (XLCoCo) that fuses multi-intent source code information retrieval from LLMs and attention-based VAEs to predict structural feature similarity, improving the performance of cross-language code-to-code search and clone detection tasks. Fourth, it proposes XLibRec, a technique for recommending analogical cross-language libraries by mining reliable library usage information from different developer community discussion forums, along with Library short descriptions collected from various package managers. Finally, it introduces XAPIRec, an efficient method for analogical API mapping based on API usage patterns, mined from functionally equivalent API usage patterns collected in an automatic way, and LLM-driven API document similarity, which completely replaces the need to manually mine functionally similar parallel code fragments or any prior knowledge of true mapped API or a labeled API mapping dataset.
Together, these contributions form a scalable ecosystem that advances automation, accuracy, and practical applicability in industry-level cross-language software development and maintenance. The techniques are extensively evaluated against state-of-the-art baselines across diverse datasets and programming languages, demonstrating consistent improvements in precision, recall, ranking quality, and real-world usability. Overall, this thesis offers a unified framework to support developers and organizations in building, understanding, and maintaining robust cross-language software systems in large scale.