Logo of the University of Passau

Find the bug – what AI can really do in software engineering

Find the bug – what AI can really do in software engineering

Can language models such as GPT really understand what a ‘bug’ is? This is one of the questions being investigated by a research team at the University of Passau in a project funded by the DFG.

Language models such as GPT and BERT have developed impressive capabilities in recent years. They translate, summarise, write code and even compose poetry. But do these models actually understand what they are saying, or are they simply repeating patterns from huge amounts of text? Large AI models tend to struggle in specialist areas such as software engineering, where many terms have multiple meanings.

‘In our research, we were able to show that the systems have problems with ambiguous terms. “Bug” and “root”, for example, have completely different meanings in computer science than in botany,’ explains Professor Steffen Herbold, Chair of AI Engineering at the University of Passau. In the DFG project ‘SENLP – Knowledge about Software Engineering in NLP Models’, researchers led by Professor Herbold are taking a closer look at how reliably language models handle specialist knowledge from software development and how their limitations can be better understood.

How AI models deal with nonsense

To this end, the researchers are putting the knowledge of large language models to the test – literally. ‘We treat the AI models like exam candidates and test whether they can answer technical questions,’ explains Professor Herbold. For example, can they recognise a correct definition in a multiple-choice test? Can they correctly distinguish between similar concepts and explain the differences?

Another topic is dealing with nonsense, which is often referred to as model hallucinations. ‘It is well known that large language models generate nonsense. We are looking at how pronounced the problem is in software development.’ To this end, the researchers are not only investigating the extent to which the models themselves generate nonsense. They are also testing how the systems react when they receive a nonsensical input. ‘We want to know whether the models can recognise nonsense.’

Language models in comparison

The researchers systematically evaluate the responses and also compare the assessments of different model architectures: How do smaller, specialised models with so-called encoder-only architecture such as BERT compare with large decoder-only models such as GPT?

The Passau team wants to find out whether models trained on general text corpora can still internalise specialist knowledge from software engineering, or whether domain-specific pre-training is absolutely necessary. Based on these findings, the researchers are developing methodological foundations for testing and improving specialist knowledge in large AI models in a more targeted manner in the future.

The Deutsche Forschungsgemeinschaft (German Research Foundation, DFG) is funding the project for a period of three years.

This text was machine-translated from German.

Symbolic picture: Adobe Stock

Principal Investigator(s) at the University Prof. Dr. Steffen Herbold (Lehrstuhl für AI Engineering)
Project period 01.04.2024 - 31.03.2027
Source of funding
DFG - Deutsche Forschungsgemeinschaft > DFG - Sachbeihilfe
DFG - Deutsche Forschungsgemeinschaft > DFG - Sachbeihilfe
Projektnummer 524228075
Förderhinweis

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation). 

I agree that a connection to the Vimeo server will be established when the video is played and that personal data (e.g. your IP address) will be transmitted.
I agree that a connection to the YouTube server will be established when the video is played and that personal data (e.g. your IP address) will be transmitted.
Show video