检索结果-内蒙古大学图书馆

arXiv 2025年

作者： Li, Jia Zhu, Hao Liu, Huanyu Shi, Xianjie Zong, He Dong, Yihong Zhang, Kechi Jiang, Siyuan Jin, Zhi Li, Ge Peking University China aiXcoder

Repository-level code completion is to complete code based on the long contexts of the repository. Existing studies extract long contexts from the repository as inputs and leverage Large Language Models (LLMs) to generate code. However, we reveal a severe limitation of LLMs, i.e., LLMs may ignore the information within long contexts in code completion. In other words, even the contexts contain useful information (e.g., relevant APIs or similar code), LLMs may fail to utilize this information. We think this limitation is caused by an inherent bias in LLMs, i.e., relying on nearby contexts and ignoring long-range contexts. To address the above limitation, we propose a novel fine-tuning approach named CoLT. The core idea of CoLT is to provide explicit supervision signals, which emphasize that long-range contexts may hold relevant information. Specifically, CoLT proposes a reinforcement learning-based training, which explicitly encourages models to utilize the information within long contexts and punishes models for ignoring long contexts. To support CoLT, we release a large-scale repo-level code completion dataset - CoLT-132K. It covers four languages (e.g., Python and Java) and comprises 132,000 samples. Each sample takes a long context (up to 128K tokens) as input. We apply CoLT to a popular LLM - aixcoder-7B and release aixcoder-7B-v2. We conduct extensive experiments on CoLT-132K and a public benchmark - CrossCodeEval. Our experiments yield the following results. 1 Effectiveness. CoLT substantially improves aixcoder-7B. aixcoder-7B-v2 outperforms aixcoder-7B by up to 44% in exact match. aixcoder-7B-v2 becomes the state-of-the-art 7B model in code completion and even surpasses larger models (e.g., DeepSeek-Coder-33B). 2 Generalizability. The capability learned by CoLT can generalize to new languages (i.e., languages not in training data). Besides, CoLT is model-agnostic and effectively improves multiple LLMs. 3 Enhanced Context Utilization Capability. CoLT significa

关键词： Reinforcement learning

来源：评论

学校读者我要写书评

暂无评论

aixcoder-7B: A Lightweight and Effective Large Language Model for Code Processing

arXiv

引用

arXiv 2024年

作者： Jiang, Siyuan Li, Jia Zong, He Liu, Huanyu Zhu, Hao Hu, Shukai Li, Erlu Ding, Jiazheng Han, Yu Ning, Wei Wang, Gen Dong, Yihong Zhang, Kechi Li, Ge aiXcoder Beijing China Peking University Beijing China

Large Language Models (LLMs) have been widely used in code completion, and researchers are focusing on scaling up LLMs to improve their accuracy. However, larger LLMs have lower inference efficiency, affecting developers’ experience and productivity. In this paper, we propose a lightweight and effective LLM for code completion named aixcoder-7B. Compared to existing LLMs, aixcoder-7B achieves higher code completion accuracy while having smaller scales (i.e., 7 billion parameters). We attribute the superiority of aixcoder-7B to three key factors: 1 Multi-objective training. We employ three training objectives, one of which is our proposed Structured Fill-In-the-Middle (SFIM). SFIM considers the syntax structures in code and effectively improves the performance of LLMs for code. 2 Diverse data sampling strategies. They consider inter-file relationships and enhance the capability of LLMs in understanding cross-file contexts. 3 Extensive high-quality data. We establish a rigorous data collection pipeline and consume a total of 1.2 trillion unique tokens for training aixcoder-7B. This vast volume of data enables aixcoder-7B to learn a broad code distribution. We evaluate aixcoder-7B in five popular code completion benchmarks and a new benchmark collected by this paper. The results show that aixcoder-7B outperforms the latest seven LLMs with similar sizes and even surpasses four larger LLMs (e.g., StarCoder2-15B and CodeLlama-34B), positioning aixcoder-7B as a lightweight and effective LLM for academia and industry. Finally, we summarize three valuable insights for helping practitioners train the next generations of LLMs for code. aixcoder-7B has been open-sourced and gained significant attention [1]. Until January 2025, aixcoder-7B has received 2,226 GitHub Stars. Copyright © 2024, The Authors. All rights reserved.

关键词： Benchmarking

来源：评论

学校读者我要写书评

暂无评论

SKCODER: A Sketch-based Approach for Automatic Code Generation 23

SKCODER: A Sketch-based Approach for Automatic Code Generati...

引用

45th IEEE/ACM International Conference on Software Engineering (ICSE)

作者： Li, Jia Li, Yongmin Li, Ge Jin, Zhi Hao, Yiyang Hu, Xing Peking Univ MoE Key Lab High Confidence Software Technol Beijing Peoples R China aiXcoder Beijing Peoples R China Zhejiang Univ Ningbo Peoples R China

ISBN: (纸本)9781665457019

Recently, deep learning techniques have shown great success in automatic code generation. Inspired by the code reuse, some researchers propose copy-based approaches that can copy the content from similar code snippets to obtain better performance. Practically, human developers recognize the content in the similar code that is relevant to their needs, which can be viewed as a code sketch. The sketch is further edited to the desired code. However, existing copy-based approaches ignore the code sketches and tend to repeat the similar code without necessary modifications, which leads to generating wrong results. In this paper, we propose a sketch-based code generation approach named SKCODER to mimic developers' code reuse behavior. Given a natural language requirement, SKCODER retrieves a similar code snippet, extracts relevant parts as a code sketch, and edits the sketch into the desired code. Our motivations are that the extracted sketch provides a well-formed pattern for telling models "how to write". The post-editing further adds requirement-specific details into the sketch and outputs the complete code. We conduct experiments on two public datasets and a new dataset collected by this work. We compare our approach to 20 baselines using 5 widely used metrics. Experimental results show that (1) SKCODER can generate more correct programs, and outperforms the state-of-the-art - CodeT5-base by 30.30%, 35.39%, and 29.62% on three datasets. (2) Our approach is effective to multiple code generation models and improves them by up to 120.1% in Pass@1. (3) We investigate three plausible code sketches and discuss the importance of sketches. (4) We manually evaluate the generated code and prove the superiority of our SKCODER in three aspects.

关键词： Code Generation Deep Learning

来源：评论

学校读者我要写书评

暂无评论

AixBench: A Code Generation Benchmark Dataset

arXiv

引用

arXiv 2022年

作者： Hao, Yiyang Li, Ge Liu, Yongqiang Miao, Xiaowei Zong, He Jiang, Siyuan Liu, Yang Wei, He Aixcoder China Peking University China

We present a benchmark dataset for evaluating method-level code generation task. The benchmark contains a dataset of 175 samples for automated evaluation and a dataset of 161 samples for manual evaluation. We also present a new metric for automatically evaluating the correctness of the generated code, and a set of criteria to manually evaluating the overall quality of the generated code. © 2022, CC BY.

关键词： Automation

来源：评论

学校读者我要写书评

暂无评论

FANformer: Improving Large Language Models Through Effective Periodicity Modeling

arXiv

引用

arXiv 2025年

作者： Dong, Yihong Li, Ge Jiang, Xue Tao, Yongding Zhang, Kechi Zhu, Hao Liu, Huanyu Ding, Jiazheng Li, Jia Deng, Jinliang Mei, Hong School of Computer Science Peking University China aiXcoder The Hong Kong University of Science and Technology Hong Kong Advanced Institute of Big Data

Periodicity, as one of the most important basic characteristics, lays the foundation for facilitating structured knowledge acquisition and systematic cognitive processes within human learning paradigms. However, the potential flaws of periodicity modeling in Transformer affect the learning efficiency and establishment of underlying principles from data for large language models (LLMs) built upon it. In this paper, we demonstrate that integrating effective periodicity modeling can improve the learning efficiency and performance of LLMs. We introduce FANformer, which adapts Fourier Analysis Network (FAN) into attention mechanism to achieve efficient periodicity modeling, by modifying the feature projection process of attention mechanism. Extensive experimental results on language modeling show that FANformer consistently outperforms Transformer when scaling up model size and training tokens, underscoring its superior learning efficiency. Our pretrained FANformer-1B exhibits marked improvements on downstream tasks compared to open-source LLMs with similar model parameters or training tokens. Moreover, we reveal that FANformer exhibits superior ability to learn and apply rules for reasoning compared to Transformer. The results position FANformer as an effective and promising architecture for advancing LLMs. Our code and pretrained model are available at https://***/YihongDong/FANformer. Copyright © 2025, The Authors. All rights reserved.

关键词： Modeling languages

来源：评论

学校读者我要写书评

暂无评论

SKCODER: A Sketch-based Approach for Automatic Code Generation

arXiv

引用

arXiv 2023年

作者： Li, Jia Li, Yongmin Li, Ge Jin, Zhi Hao, Yiyang Hu, Xing Key Lab of High Confidence Software Technology MoE Peking University Beijing China aiXcoder Beijing China Zhejiang University Ningbo China

—Recently, deep learning techniques have shown great success in automatic code generation. Inspired by the code reuse, some researchers propose copy-based approaches that can copy the content from similar code snippets to obtain better performance. Practically, human developers recognize the content in the similar code that is relevant to their needs, which can be viewed as a code sketch. The sketch is further edited to the desired code. However, existing copy-based approaches ignore the code sketches and tend to repeat the similar code without necessary modifications, which leads to generating wrong results. In this paper, we propose a sketch-based code generation approach named SKCODER to mimic developers’ code reuse behavior. Given a natural language requirement, SKCODER retrieves a similar code snippet, extracts relevant parts as a code sketch, and edits the sketch into the desired code. Our motivations are that the extracted sketch provides a well-formed pattern for telling models "how to write". The post-editing further adds requirement-specific details into the sketch and outputs the complete code. We conduct experiments on two public datasets and a new dataset collected by this work. We compare our approach to 20 baselines using 5 widely used metrics. Experimental results show that (1) SKCODER can generate more correct programs, and outperforms the state-of-the-art – CodeT5-base by 30.30%, 35.39%, and 29.62% on three datasets. (2) Our approach is effective to multiple code generation models and improves them by up to 120.1% in Pass@1. (3) We investigate three plausible code sketches and discuss the importance of sketches. (4) We manually evaluate the generated code and prove the superiority of our SKCODER in three aspects. Copyright © 2023, The Authors. All rights reserved.

关键词： Deep learning

来源：评论

学校读者我要写书评

暂无评论

SkCoder: A Sketch-based Approach for Automatic Code Generation

SkCoder: A Sketch-based Approach for Automatic Code Generati...

引用

International Conference on Software Engineering (ICSE)

作者： Jia Li Yongmin Li Ge Li Zhi Jin Yiyang Hao Xing Hu Key Lab of High Confidence Software Technology MoE (Peking University) Beijing China aiXcoder Beijing China Zhejiang University Ningbo China

Recently, deep learning techniques have shown great success in automatic code generation. Inspired by the code reuse, some researchers propose copy-based approaches that can copy the content from similar code snippets to obtain better performance. Practically, human developers recognize the content in the similar code that is relevant to their needs, which can be viewed as a code sketch. The sketch is further edited to the desired code. However, existing copy-based approaches ignore the code sketches and tend to repeat the similar code without necessary modifications, which leads to generating wrong results. In this paper, we propose a sketch-based code generation approach named Skcoderto mimic developers' code reuse behavior. Given a natural language requirement, Skcoderretrieves a similar code snippet, extracts relevant parts as a code sketch, and edits the sketch into the desired code. Our motivations are that the extracted sketch provides a well-formed pattern for telling models “how to write”. The post-editing further adds requirement-specific details into the sketch and outputs the complete code. We conduct experiments on two public datasets and a new dataset collected by this work. We compare our approach to 20 baselines using 5 widely used metrics. Experimental results show that (1) Skcodercan generate more correct programs, and outperforms the state-of-the-art -CodeT5-base by 30.30%, 35.39%, and 29.62% on three datasets. (2) Our approach is effective to multiple code generation models and improves them by up to 120.1% in Pass@l. (3) We investigate three plausible code sketches and discuss the importance of sketches. (4) We manually evaluate the generated code and prove the superiority of our Skcoderin three aspects.

关键词：

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：