It’s too soon to trust Microsoft’s GitHub Copilot to automatically fix your programming code. Microsoft itself has said that the program, sold as a $10 per month add-on to GitHub, “does not write perfect code,” and “may contain insecure coding patterns, bugs, or references to outdated APIs or idioms.”
The dream of automation, however, suggests that someday, artificial intelligence will predict a fault in a program that can break functionality, or bring systems down, and not only warn a developer before the code goes into production, but also tell them how to alter code to avert the problem. AI might even be able to reach into the application code and automatically fix it for the programmer, saving them significant effort.
The makings of such a future can be seen in today’s tools for DevOps and observability. DevOps tool maker Dynatrace has for a number of years been building what it calls “causal AI,” and “predictive AI,” to identify why programs go down, and to predict how they’ll fail.
The next stage is wrapping generative AI around those observability tools to give coders suggestions as to how their code is going to run into trouble and how to alleviate it.
“The typical request from a CIO is, please fix my system before it actually fails,” said Bernd Greifeneder, chief technology officer and co-founder of Dynatrace, in an interview with ZDNET. Dynatrace is a commercial software vendor in the DevOps and Observability market that sells tools for application lifecycle management.
Consider an everyday systems pitfall: running out of disk space in Amazon’s AWS.
“It’s totally ironic,” noted Greifeneder. “Even in these days of super-high tech, it is a problem that cloud disks somewhere at AWS run out of disk space, and we have to trigger API calls in order to resize them. We don’t want to resize them [the disks] up-front because it’s costly, so we want to optimize what we use, but the usage patterns can change depending on how many customers we have in our clusters and so forth.”
What’s needed is to create code that will spring into action when an out-of-disk error looks likely based on past performance.
To tackle the problem, the company first identifies a “root cause” of a disk failure with the combination of causal and predictive AI. These two tools are not based on large language models and other generative AI. Instead, they rely on older, more well-established forms of artificial intelligence that can be counted on to produce rigorous, consistent results.
In the case of causal AI, the program employs several algorithms including quantile regression, density estimation, and what’s known as a random surfing model. Unlike neural nets that are trained on a static set of data to detect correlations between data points, the causal programs are used to traverse a graph representing elements of a company’s IT system and their relationships.
“Typical statistical models or neural network type of learning models do not work for dynamic IT systems in the bigger scope,” said Greifeneder, because variables change too much. “Our customers may have tens of thousands to hundreds of thousands of pods, and many of them are interconnected, and change while traffic is routed, and things scale, and there are different versions, etc.”
To build what Greifeneder calls an “in-memory, real-time model” of a customer’s entire IT system, the causal AI programs construct a “multi-dimensional model that has the causal, directed dependency, sort of like a multidimensional graph” of all the entities — from what cloud service it is to what version of Kubernetes is being used to what app is running. That model, called Smartscape, is consulted whenever there is a system issue that raises alarms, “inferring the root cause based on traversing that Smartscape model.”
That causal model won’t anticipate variations in the business, however. “It knows the root cause” of things, “but what it does not know, is, what is your business pattern,” said Greifeneder, meaning, things such as, “Monday morning at 8:00 AM, you have a big spike in usage for whatever reason.”
For such aberrations, “there needs to be some form of history-based learning,” said Greifeneder.
To achieve that historical learning, a predictive AI component uses another set of well-developed tools, such as an autoregressive integrated moving average, which is an algorithm that’s particularly attuned to piecing together patterns occurring in data over time.
Crucially, predictive AI does not look only at back-end systems, such as the server. It also receives signals from endpoints in a network, things such as how the end user is experiencing lag or interrupted service.
“Looking at server-side systems alone is not good enough,” said Greifeneder. “Real user monitoring, for instance, or API service monitoring, is an important aspect in understanding the dependencies.”
While a CIO cares most about systems, user issues may crop up even when servers are running fine, so back-end and user experience both need to be measured and compared.
“Sometimes we meet the IT-only person who cares only about their servers — ‘Oh, my server is up’ — but, actually, users are frustrated,” he said. “The opposite exists: Just because one of those CPUs goes wild, it doesn’t mean the end user is impacted.”
Returning to the disk space example, the causal and the predictive AI can anticipate a future disk issue. “We can extrapolate from the past days and weeks of usage in the cluster to see, ‘Oh, we run the risk that in a week from now, we might run out of disk space,’ ” said Greifeneder.
That is the impetus to take proactive steps, such as, “Let’s trigger now a workflow action from Dynatrace’s automation engine to call an API into AWS to resize the disk and therefore automatically prevent an outage that we had in the past because of this.”
It’s here that generative AI gets looped into the process. The Dynatrace umbrella program, Davis AI, this year added a component called Davis CoPilot that rides on top of the causal and predictive systems.
A user can type to the CoPilot, “create me an automation that actually prevents this [disk outage proactively].” The CoPilot can send an inquiry to the causal and predictive AI to ask what disks are being referred to in that prompt. In response, the Davis program uses the Smartscape and the predictive information to create a prompt with all the contextual details that are required to understand the IT system in its current state.
That prompt is then sent to the CoPilot, which, once given the details, “will give you back the template of the workflow to automate” the disk re-sizing, explained Greifeneder. “It will give you, as the user, the ability to review and say, OK, this is approximately right, thank you, you helped me get 90% there,” which can save the systems engineer time versus building a workflow from scratch.
The next step is for the Davis AI program to bring all these observations back to the programmer at the time they are first coding the application. The holy grail of application development is to prevent coding that causes faults before that application is put into production rather than having to fix things later.
One approach is what Dynatrace calls a guardian. A DevOps individual can ask the CoPilot in natural language to create a guardian to watch over a particular application performance goal before that application is put into production. The company terms this “defining a quality objective in the code.” The causal and predictive elements are then used to verify whether or not the code will meet the objectives that have been defined.
Of course, if the Davis AI notes a potentially problematic code, the issue is then how to fix it. It is possible to have the Davis CoPilot advise the programmer on possible code fixes, though that is still an emerging area.
“We are thinking about, with this Davis CoPilot, providing recommendations on how we identified this vulnerability in production based on your technology stack, and Davis CoPilot provides you these recommendations that you should check out to fix in your code,” Greifeneder told ZDNET.
It’s still early in the use of generative AI for those kinds of code fix recommendations, said Greifeneder. While the causal-predictive AI is engineered to be reliable, the generative algorithms still suffer from the phenomenon of “hallucinations,” meaning that the program confidently asserts inaccurate information.
“What is reliable is what comes from the causal AI because that is the accurate system state,” he said. “So, we know exactly what’s there; what is not reliable is the potential recommendation on how to change the code because this comes from the public GPT-4 models.”
Therefore, code suggestions for remediation may start from a valid premise, but they run into the same issue as GitHub Co-pilot: not having a really rigorous sense of what code is appropriate. There’s a need to integrate large language models more closely with the tools that Dynatrace and others provide, to give some grounding to generative AI’s suggestions.
Formal studies of GPT-4 and their ilk report very mixed results in finding and fixing code vulnerabilities. The technical paper released by OpenAI with the introduction of GPT-4 in March cautioned against relying on the program. GPT-4, it said, “[…] performed poorly at building exploits for the vulnerabilities that were identified.”
A study in February of GPT-4’s predecessor, GPT-3, by University of Pennsylvania researcher Chris Koch, was encouraging. It showed that GPT-3 was able to find 213 vulnerabilities in a collection of GitHub repository files curated for their known vulnerabilities. That number was well above the 99 errors found by a popular code evaluation tool named Snyk, a form of “Static Application Security Testing,” or SAST, commonly used to test for software vulnerabilities.
But, noted Koch, both GPT-3 and Snyk missed a lot of vulnerabilities — they had a lot of “false negatives,” as they’re known.
A subsequent study, by cybersecurity firm PeopleTec, built upon Koch’s work, testing an updated GPT-4 released in August. It found that GPT-4 uncovered four times as many vulnerabilities in the same files.
However, in both studies, GPT-4 was tested on files representing a grand total of just over 2,000 lines of code. That is minuscule compared to full production applications, which can contain hundreds of thousands to millions of lines of code, across numerous linked files. It’s not clear that successes on the toy problems of the GitHub files will scale to such complexity.
The race is underway to try and amplify language models for that greater task. In addition to Dynatrace, privately held Snyk Ltd. of the UK, which sells a commercial version of the open-source tool, offers what it calls “DeepCode AI.” That technology, Snyk claimed, can avoid the stumbles of generative AI by integrating it with other tools. “DeepCode AI’s hybrid approach uses multiple models and security-specific training sets for one purpose — to secure applications,” the company stated.
It’s clear that generative AI has a ways to go to solve even simple kinds of programming debugging and fixing, leaving aside the complexity of a live production IT environment. The great shift left via AI is not here yet.
What is on the horizon, with Davis Copilot and efforts like it, is using generative AI as a new interface to help coders examine their own code more aggressively both before and after they ship that code.