GPIoT: Tailoring Small Language Models for IoT Program Synthesis and Development

1The Hong Kong Polytechnic University, 2University of Cambridge
The architecture of GPIoT

Abstract

Code Large Language Models (LLMs) enhance software development efficiency by automatically generating code and documentation based on user requirements. However, code LLMs cannot synthesize specialized programs when tasked with IoT applications that require domain knowledge. While Retrieval-Augmented Generation (RAG) offers a promising solution by fetching relevant domain knowledge, it necessitates powerful cloud LLMs (e.g., GPT-4) to process user requirements and retrieved contents, which raises significant privacy concerns. This approach also suffers from unstable networks and prohibitive LLM query costs. Moreover, it is challenging to ensure the correctness and relevance of the fetched contents. To address these issues, we propose GPIoT, a code generation system for IoT applications by fine-tuning locally deployable Small Language Models (SLMs) on IoT-specialized datasets. SLMs have smaller model sizes, allowing efficient local deployment and execution to mitigate privacy concerns and network uncertainty. Furthermore, by fine-tuning SLMs with our IoT-specialized datasets, the SLMs' ability to synthesize IoT-related programs can be substantially improved. To evaluate GPIoT's capability in synthesizing programs for IoT applications, we develop a benchmark, IoTBench. Extensive experiments and user trials demonstrate the effectiveness of GPIoT in generating IoT-specialized code, outperforming state-of-the-art code LLMs with an average task accuracy increment of 64.7% and significant improvements in user satisfaction.

Introduction

Large language models (LLMs) are revolutionizing various aspects of embedded system development and mobile computing, e.g., smartphone task automation, advanced virtual assistants (Nie et al., 2024), and even IoT data comprehension. Code LLMs (e.g., WizardCoder and CodeLlama) stand out as promising tools designed to synthesize programs based on user requirements described in natural language. As illustrated in Fig. 1, the integration of programming tools with code LLMs significantly enhances software development by automating code completion, code generation, bug detection, documentation writing, etc.

Figure 1: Existing code LLMs need to transmit sensitive data to remote servers. In contrast, GPIoT features three local SLMs to protect user privacy and reduce query costs.

While powerful and promising, when confronted with IoT applications that require special domain knowledge, existing code LLMs tend to simply provide general solutions with sub-optimal performance. This is because they focus on general-purpose programming tasks rather than being tailored to any particular domain. Moreover, IoT-related knowledge and programs only occupy a small proportion of the datasets which code LLMs were trained on. Consequently, IoT terminologies will be assigned a lower priority during inference with the generated code less dedicated to the IoT domain. This motivates the following research question:

Can we build a code LLM specially tailored for IoT application code generation tasks?

A potential approach can be Retrieval-Augmented Generation (RAG), which provides LLMs with retrieved domain knowledge to enhance their abilities in generating accurate and contextually relevant solutions. Existing works construct a sophisticated LLM+RAG agent to gradually generate code through multiple intermediate steps via prompts. Nonetheless, they suffer from three main problems.

  • A powerful LLM with strong language comprehension capability is needed to learn from the retrieved knowledge. However, cloud LLMs (e.g., GPT-4) may suffer from bad network conditions, high costs, and privacy concerns, while local LLMs (e.g., Llama2-70b) have harsh requirements in system resources (e.g., memory and network).
  • Complicated RAG designs (e.g., iterative retrieval) are mandatory to ensure the correctness and high relevance of the retrieved knowledge, with extended processing time. Otherwise, LLMs may fail to focus on the IoT context and still provide general solutions.
  • Meticulously designed prompts are required to ensure that outputs must strictly follow pre-defined formats (Lin et al., 2024), which is extremely challenging due to the hallucinations and unreliability of LLMs.

To tackle the above problems, we propose GPIoT, a code generation system tailored for IoT application development by fine-tuning local small language models1 (SLMs) on IoT-specialized text-generation datasets. This approach has the following benefits:

  • The system overhead, privacy leakage, and network instability can be mitigated, as SLMs have smaller sizes and can be locally deployed without incurring heavy resource burdens.
  • SLMs tuned on IoT-specialized datasets can generate responses with significantly enhanced quality and higher relevance to the IoT domain.
  • As our tuning datasets are well-structured text data, the tuned SLMs can produce intermediate outputs following the expected format with enhanced stability and avoid hallucinations.

Motivation

Existing code LLMs aim to synthesize programs and enhance software development efficiency and accuracy. While they perform well on general and simple programming tasks (e.g., sorting algorithms), they often struggle with complex problems in the IoT domain. For example, when prompted to design an R-peak detection method for electrocardiogram (ECG) data, existing code LLMs can only use the find_peaks() function, which adopts a general peak detection algorithm rather than a dedicated one tailored for ECG data (e.g., Pan-Tompkins). The underlying reason is that IoT knowledge and programs only occupy a small proportion of the training dataset of code LLMs. As a result, despite being presented with abundant IoT terminologies in the prompt, LLMs still tend to prioritize and respond with more general words, due to their higher similarity (shorter distance in Fig. 2(a)) within the vector representation space.

Figure 2: (a) Existing LLMs tend to prioritize general terms; (b) LLM+RAG systems require multiple cascaded agents.

LLM+RAG methods address this by retrieving domain knowledge for reference and establishing multiple cascaded agents to facilitate information transfer among modules. For example, in Fig. 2(b), multiple LLM-based agents are employed for different tasks during development (i.e., domain knowledge retrieval, task planning, coding, and debugging). Sophisticated prompt design and meticulously structured intermediate outputs are necessary to ensure that one agent's output can be accurately parsed and interpreted by another agent. We conduct a preliminary experiment by prompting MapCoder, a multi-agent-based LLM+RAG framework, to synthesize programs for R-peak detection. We repeatedly generate 100 distinct versions of the programs and analyze them through code review and execution. Surprisingly, we find that only 28% of the programs adopt appropriate IoT-related algorithms to perform R-peak detection. This is because LLM+RAG requires sophisticated RAG design and user prompts. Otherwise, the retrieved knowledge is less accurate and relevant to the IoT context, and LLMs may fail to focus on the IoT domain and still provide simple and general solutions. Moreover, this cascading process inevitably introduces noise and propagates errors, leading to a long self-recover time. More importantly, cloud LLMs suffer from bad network conditions, high costs, and privacy concerns.

To overcome these challenges, GPIoT fine-tunes SLMs on IoT-specialized text-generation datasets, as SLMs have smaller sizes and can be locally deployed. Additionally, by steering the parameter distribution towards the IoT domain via tuning, SLMs can focus on IoT-related semantic context, generating highly relevant responses that follow pre-defined formats with enhanced stability.

System Overview

Fig. 3 illustrates the architecture of GPIoT, consisting of an offline tuning stage and an online processing stage.

  • Offline Stage. The offline tuning stage (the left part in Fig. 3) constructs two IoT-specialized datasets and fine-tunes TDSLM and CGSLM, which will be used for task decomposition and code generation in the online stage, respectively. We first build a RAG agent to extract knowledge and programs from various IoT-related public sources (e.g., websites and articles) to construct high-quality datasets. Then, we augment the datasets by adopting our IoT-oriented augmentation method to enhance their quantity, quality, and diversity. Note that the RAG agent is only used for high-quality dataset construction during the offline stage. With the two augmented datasets, we fine-tune two SLMs via our PECT paradigm, where certain model parameters are collaboratively tuned through a multiple-path LoRA pipeline with two projection layers for task decomposition and code generation, respectively. Our PECT paradigm mitigates the domain misalignment between TDSLM and CGSLM with facilitated knowledge transfer and sharing.
  • Online Stage. The online stage (the right part in Fig. 3) aims to synthesize IoT-specific programs based on the user requirement for an IoT application development. Specifically, GPIoT first leverages Task Decomposition SLM (TDSLM) to decompose the IoT application into multiple manageable sub-tasks with detailed descriptions (①∼②). Next, through CoT-based prompting techniques, the sub-task descriptions will be gradually transformed into well-structured specifications by Requirement Transformation SLM (RTSLM) (③∼④). Next, for each sub-task, Code Generation SLM (CGSLM) accordingly generates a code snippet with documentation (⑤∼⑥). Users can execute the code sequentially to realize the IoT application based on the instructions from the documentation (⑦).
Figure 3: System overview.

Evaluation

We construct a benchmark, IoTBench, to evaluate the overall performance of GPIoT on various IoT applications.

IoTBench

To evaluate LLMs' abilities in task decomposition and code generation for IoT applications, we create IoTBench, a benchmark of text-generation tasks in the IoT domain. Specifically, we choose 100 samples from TDD and CGD with manually created test cases, covering various IoT topics (e.g., signal processing, edge AI, etc.). All the selected data samples are first manually filtered to ensure correctness and relevance to the IoT domain. Then, we format the sub-tasks separated by a blank line in between. Note that although many SOTA benchmarks (e.g., HumanEval) can also evaluate LLMs' code generation abilities, they are not tailored to IoT tasks. Besides, the data in IoTBench is excluded from the tuning processes to test the generalizability of the tuned SLMs.

1. Heartbeat Detection

As shown in Fig. 4, the code generated by GPIoT significantly outperforms all the baselines, with an average precision gain of 64.7% and an average RR increase of 16.9%. It's also worth noting that CodeLlama, WizardCoder, and CodeQwen achieve moderate RR (above 80%) but exhibit lower precision. With further analysis of the code, we find that they all adopt a simple peak detection function, scipy.signal.find_peaks(), which typically fails when handling abnormal ECG data from patients. As a result, the detection results contain numerous false positives with low precision.

Figure 4: The overall performance of heartbeat detection.

2. Human Activity Recognition

In this evaluation, for all the generated HAR models, we set the training epochs to 10 and the batch size to 32 for a fair comparison. Besides, during our implementation, we find that after around 15 training epochs, all the models gradually converge. Therefore, we compare the model performance at the 10th epoch. As shown in Fig. 5, the program synthesized by GPIoT achieves a 17.2% higher accuracy with 47.8% less GPU memory and 38.3% shorter inference time on average.

Figure 5: The overall performance of human activity recognition.

3. Multimodal HAR

We further instruct GPIoT to synthesize programs for the multimodal HAR application, aiming to evaluate its programming ability for more complex tasks. As shown in Fig. 6, compared with the baselines, the program synthesized by GPIoT achieves an average accuracy improvement of 13.44% while requiring moderate GPU memory and inference time. After reviewing the source code, we find that both GPIoT and the baselines train three encoders to first extract useful features from different modalities, followed by a classifier to recognize the corresponding activity. However, the program synthesized by GPIoT adopts some model optimization methods (e.g., quantization or pruning) and data augmentation methods tailored for IoT sensor data (e.g., time-frequency masking). As such, the synthesized program can train a memory-optimized model while maintaining high classification accuracy. These results indicate that, benefiting from our SLM tuning, the program synthesized by GPIoT can incorporate more IoT-specific data processing and model optimization algorithms, thereby achieving high performance even for multimodal HAR.

Figure 6: The overall performance of multimodal HAR.

TDSLM Evaluation

We input each problem statement from IoTBench into TDSLM and the baselines to generate 20 different decomposed tasks and calculate the average BLEU score, FCR, and URC. From Fig. 17, we observe:

  • The decomposed tasks generated by TDSLM achieve a 48% higher BLEU score than the baselines on average, indicating a stronger decomposition ability for IoT tasks.
  • TDSLM achieves 99% FCR, indicating remarkable stability to generate intermediate output (decomposed tasks) based on pre-defined formats.
  • TDSLM also achieves a 28% higher STC on average, showcasing strong abilities in understanding IoT knowledge and generating comprehensive decomposed tasks for IoT applications.
Such superior IoT task decomposition and data formatting performance of TDSLM originate from the tuning process on TDD with our IoT-oriented text data augmentation method.

Figure 7: Breakdown evaluation on TDSLM.

CGSLM Evaluation

We input each task specification from IoTBench into CGSLM and the baselines to generate 20 different code & documentation. We then report the average code embedding similarity, pass@1, pass@5, URC, and the number of various issues detected by SonarQube. by executing and reviewing the code. For comparison, we also report the pass@1 achieved by CodeLlama on a general-purpose programming benchmark, HumanEval, as shown in Fig. 8.

Figure 8: Breakdown evaluation on CGSLM.

User Study

We conduct a user study to evaluate the functionality, generalizability, and overall satisfaction of GPIoT for IoT application development. Specifically, with GPIoT deployed on an edge server, we invite 5 experts and 15 non-experts in IoT and ask them to freely express their requirements for any IoT application development that requires signal processing or AI technologies. By sequentially executing the generated code based on the instructions in the documentation, we ask the users to rate GPIoT based on five metrics:

  • Overall Code Performance (OCP) evaluates the overall performance of the generated code on corresponding test data considering task accuracy, runtime efficiency, and resource consumption.
  • Code & Documentation Readability (CDR) measures the clarity and structure of the code and documentation.
  • Generation Efficiency (GE) accesses how efficiently GPIoT operates in terms of speed and resource usage to produce the final results.
  • Code Modularity (CM) judges whether the code is properly modularized for easy reuse and extension.
  • User Satisfaction (US) captures users' feedback regarding their overall personal experience.
All the above metrics are rated by the users on a scale from 1 (not at all) to 5 (completely). Github Copilot, DeepSeek-Coder (cloud), CodeLlama (local), and MapCoder (agent) serve as representative baselines for comparison. The results are shown in Fig. 9.

Figure 9: User study (Objective Evaluation).

Conclusion

We present GPIoT, a tailored local code generation system that synthesizes programs with documentation based on user requirements for IoT application development. Armed with two IoT-specialized text-generation datasets, the IoT-oriented augmentation method, and our PECT paradigm, GPIoT can generate more IoT-related code in a privacy-preserving way, achieving enhanced task accuracy and user satisfaction for IoT application development. As IoT technologies are emerging rapidly, it is also worthwhile to explore the construction of a dynamic IoT knowledge database and continuous fine-tuning of local SLMs in the future.

For more details, please refer to our Paper.

The code for implementing GPIoT is available at here.

BibTeX

@inproceedings{shen2025gpiot,
  title={GPIoT: Tailoring Small Language Models for IoT Program Synthesis and Development},
  author={Shen, Leming and Yang, Qiang and Huang, Xinyu and Ma, Zijing and Zheng, Yuanqing},
  booktitle={Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems},
  pages={1--14},
  year={2025}
}