Your AI Might Be Lying to Your Boss (22 minute read)

This post is my personal opinion based on my testing and observations. I'm pretty confident in my test methodology, but William O'Connell is human and can make mistakes, check important info, etc.

How much of your code is AI? That question would've been gibberish to me five years ago, but of course the last few years have seen an explosion of "AI-enhanced" IDEs and other software development tools. Software companies are spending huge sums of money to provide these tools to their staff, and rapidly cycling through them as the space continues to evolve.

I don't make heavy use any of these in my personal life, but I have gotten to try a handful of them through various employers. One such tool is Windsurf, a VSCode fork that most people know as the one they assume shut down after Google bought out their key leadership last year. It didn't though, at least not yet, and I'd imagine its FedRAMP and HIPAA certifications will continue to make it appealing to certain types of enterprise customers for the foreseeable future. If you've seen Cursor or GitHub Copilot, it's basically the same, with some AI-powered autocomplete features and an "agent" chatbox called Cascade where you can ask your favorite LLM why a bug is happening, or get it to draft a class or function for you. In theory these types of agents can develop features and even whole applications on their own, but in my experience the results are pretty inconsistent, so I tend to stick to simpler requests.

Screenshot of Windsurf, which is a code editor based on Visual Studio Code. On the left is a file browser, in the center is some code, and on the right is a chat window. I have asked Cascade to help fix a bug and it has identified that the problem is on line 39 of a file called DataFetchManager.svelte.ts. The suggested change is shown in the center of the screen with a gree/red diff.

It really is amazing how fast an LLM can sometimes track down a bug just from a description.

One thing that's very important to any enterprise rolling out a tool like this is metrics:

Are employees using it?
How much time is it saving?
Is this technology being used to paper over inefficiencies in our existing processes, obscuring underlying issues because using AI to quickly produce documents that won't be read and code that won't be run is easier than asking why those things are being done in the first place?

Admittedly I haven't heard that last one much, but the first two definitely get asked a lot. To help with this, Windsurf offers a dashboard of analytics at both the individual and team level. It includes things like the number of autocomplete suggestions accepted, the number of messages sent to Cascade, and which models are being used the most. It also includes a metric called "% new code written by Windsurf" (or sometimes "PCW"), which they seem quite proud of, since it gets top billing on the dashboard and they wrote a whole blog post explaining it.

The pitch is pretty simple: how much of the code did a developer write by hand, and how much did they generate with AI? When I first learned about this feature my guess would have been 10, maybe 20% AI, depending on the project and whether you include unit tests (LLMs are pretty good at those). So you can imagine my surprise when I opened the dashboard and saw this:

Screenshot from the Windsurf dashboard, showing the "% new code written by Windsurf" metric at 98%.

Don't worry employers, I didn't screenshot my work computer. This is a recreation.

Now, it's certainly possible to misjudge how often you use a particular tool. If the number had been 40%, or even 50%, I wouldn't have been that shocked. But 98%? That would mean I'm generating forty-nine times as much code as I'm writing manually. If that were true wouldn't I have run through my token budget by now? Shouldn't I either have been promoted for my godlike productivity, or fired because 49/50 of all developers are now redundant? You'd think, but Windsurf says this result is pretty normal:

"...customers should expect PCW values of 85%+, often 95%+. This is not a hallucination and is accurate given how we compute this metric, though there are a number of caveats that we will cover later in this section."

"Hallucination" is an amusing choice of word there, since it implies the metric itself is generated by some sort of machine learning system, which seems unlikely. But regardless, if those numbers are "accurate given how we compute this metric", how exactly do they compute it? To their credit, they go into a fair bit of detail:

"To compute PCW, we take the number of new, persisted bytes of code that can be attributed to an accepted AI result from Windsurf (i.e. Tab suggestion, Command generation, or Cascade edit) and the number of new, persisted bytes of code that can be attributed to the developer manually typing. ... We take these measurements whenever a commit is being made. This way if the AI added a lot of code but the developer deleted a lot of it before committing the code to the codebase, then we are not incorrectly inflating the W number. Similarly, any bytes of code that come from the developer manually editing an AI result will get attributed to the developer (D) as opposed to Windsurf."

That all sounds pretty reasonable, but I was still skeptical of the number I was seeing. I wanted to know for sure where that 98% was coming from, and what it actually meant. So I signed up for a personal Windsurf subscription, installed the editor, and ran some tests.

The Math Behind the Curtain

My original plan was to use mitmproxy to watch the outgoing network traffic from the IDE, and see what numbers it was reporting as I took different actions. That turned out to be easier said than done though, because Windsurf is quite chatty on the network, sending many requests to various domains while in use, and even pretty often when I'm not touching it at all.

Screenshot of mitmproxy GUI, showing a series of GET and POST requests to various domains including codeium.com and windsurf.com.

Additionally, Windsurf makes heavy use of protobuf, a data encoding scheme that I'm pretty sure Google invented to annoy me personally, because it makes it much harder to interpret and debug the traffic between clients and servers. If you don't have the associated definition file, a protobuf message is basically just a list of simple values (int32, bytes, etc.) with no human-readable labels. Because of this it was hard for me to tell which messages were related to the PCW metric, or what exactly they were communicating to Windsurf's cloud backend.

Luckily, I found an easier way. It turns out that even though the dashboard says "Analytics update every three hours", it actually shows new data almost instantly. And while the UI only shows the overall percentage, the response from the web server actually includes some additional data. It's protobuf as well, but since it's a webpage the source code is all immediately accessible, and of course the frontend code includes a copy of the message definitions so it can make use of the data.

Screenshot of the Windsurf analytics dashboard, with the Chrome developer tools open. We can see a request called GetAnalytics, but the response is shown in the hex viewer since it's not plain text.

So I was able to decode the GetAnalytics response and pull out these fields (among others):

user_bytes
codeium_bytes
total_bytes
percent_code_written

Windsurf used to be called Codeium, so clearly that one represents the AI-generated bytes. And as you'd expect, the percent_code_written is equal to codeium_bytes / total_bytes. So far so good, but what causes those values to change?

Windsurf says they take measurements "whenever a commit is being made", but that doesn't match my testing. Whether the folder I'm in has a git repo set up or not, as soon as I make additions to a file the user_bytes value increases, and if I delete some of those lines it decreases. Whether I do a commit (using Windsurf's git UI) between those two actions makes no difference as far as I can tell. What does make a difference is restarting the editor; it seems to forget the history of how each line was generated, so deleting code I wrote before the restart doesn't deduct from user_bytes, and deleting code Cascade wrote before the restart doesn't deduct from codeium_bytes. There is a line in the PCW article that alludes to this ("We currently do not have instrumentation to measure PCW across sessions"), but obviously that's a pretty major gap in functionality, and it doesn't actually address why the described git integration appears to be nonexistent.

To test how exactly the byte counts are being computed, I performed a few tests where I took specific actions and checked how much each value had increased. To keep things simple I disabled the AI autocomplete features (which I find more distracting than helpful anyway) and just focused on the Cascade chat experience. I created a file, human_file.js, and I typed out a single line:

console.log('This line was written by a human.');

49 characters exactly. Then I told Cascade to create a second file (ai_file.js) and to write a similar line of the same length.

Screenshot of Windsurf. I've prompted Cascade to "Create a new file, ai_file.js, which contains only the following line: console.log('This line was written by Cascade.');". Cascade has created the file, with the new lines highlighted in green.

The result:

user_bytes: 855 -> 901 (+46)
codeium_bytes: 7387 -> 7437 (+50)

So the system did seem to be working, but we have a discrepancy right off the bat. The line is definitely 49 characters (50 with a newline at the end), so why is user_bytes only reporting 46? Well this is where some technicalities start to emerge. Windsurf says that they measure "code that can be attributed to the developer manually typing". The Windsurf editor is a lightly modified version of VSCode, and like most code editors, VSCode has a feature that automatically adds closing symbols (end quotes, closing parentheses, etc.) without the user manually typing them. I suspect that because those characters are being added by that feature, they're technically not "the developer manually typing", and therefore are not counted.

If that's what's going on, then in my opinion that's already a pretty serious knock against the reliability of Windsurf's metrics. Counting closing symbols when the LLM outputs them, but not when VSCode auto-adds them, obviously biases the stats to increase the percentage of code attributed to AI (even if the effect is fairly slight). As it turns out, there may be some not-so-slight biases as well.

Continuing my test, I wrote out a simple function, and asked Cascade to write a similar function in its own file. Finally I copy/pasted Cascade's function into the human file, and asked Cascade to copy my function into its file.

Screenshot of Windsurf, showing a file called ai_file.js. Below the console.log added earlier, two functions are shown, called func_by_cascade() and func_by_a_human(). func_by_a_human() is highlighted in green, having been copied from the other file by Cascade.

Here's the final tally:

user_bytes: 1054 (+199)
codeium_bytes: 7807 (+420)

So for this session, Windsurf is reporting that Cascade generated more than twice as much code as I wrote, even though we each produced an almost identical file. I never touched ai_file.js, Cascade never touched human_file.js, and the two files are the same length (actually human_file.js is 21 bytes longer because Cascade used Unix-style line endings). Yet somehow my PCW for this session would be around 68%. The trick here is that much like with the auto-added closing symbols, it seems like any text the user pastes doesn't count towards user_bytes. I guess from a certain perspective that could sound reasonable (if you pasted code from StackOverflow you didn't really "write" it), but the way it plays out in practice quickly becomes absurd.

In another test I hand-wrote two functions in a single file, then moved them both to a second file (as one might do when refactoring). For the first I cut and pasted, for the second I asked Cascade to move it for me. The result? Cutting the first function deducts it from user_bytes, and pasting it doesn't count for anything. Cascade deleting the second function also deducts it from user_bytes, but the lines added to the new file count towards codeium_bytes. So even though I did 100% of the writing and 50% of the refactoring, Windsurf reports that 100% of the code I produced in that session was generated by AI.

In my opinion these biases make Windsurf's PCW metric basically useless. By being so picky about what counts as a human contribution, and being as generous as possible to the LLM, Windsurf (intentionally or accidentally) tips the scales towards reporting absurdly high percentages, regardless of where most of the code is actually coming from or whether it eventually gets committed.

Who Else?

So that seems... bad, but of course Windsurf is just one of many AI-enhanced IDEs out there (and it's owned by Cognition, makers of Devin, who don't have a stellar track record). What about the other products on the market? As far as I can tell Google's Antigravity editor doesn't have any comparable metrics. GitHub Copilot does provide stats on how many lines of code it generated, but not as a percentage of the total. Amazon Kiro is the same. I did find one popular editor with a metric similar to Windsurf's PCW though: Cursor, with its "AI Share of Committed Code". So how does it stack up?

Sadly Cursor only offers analytics on their business-focused "Team" plan, making this one of my costlier blog posts, but I'll do almost anything for science. Right off the bat things are looking better, with a more nuanced and considered description of their measurement approach:

"Cursor keeps a log of the signature of every AI line (Tab or Agent) that is suggested to the user during their chat session. These lines are stored and later compared to the signatures of each line in subsequent git commits that were written by the same author. ... We use the following definitions: Cursor AI: Any line that can be attributed to Cursor Agent or Tab based on diff signatures. Other: Any line of code that can't be detected as being written by Cursor"

So rather than splitting hairs about the various ways a programmer can add text to a file, they simply divide the total lines in a commit into "AI" and "Other". Sounds great, but does it work?

Well, the git integration certainly does. While Cursor does also use protobuf, it's easy to tell that it's sending an event called "ReportCommitAiAnalyticsRequest" whenever I do a commit, and that message clearly includes information about the different files and what seem to be the line ranges produced by different methods. We can also see the results on the Cursor website, though it takes a while for them to appear. Running my same test from before, we get a much more reasonable result:

Screenshot from the Cursor website, showing "AI Share of Committed Code" as 52.6%. Below the metrics there is a bar graph showing how much code came from different sources.

I'm not sure why the bar graph doesn't go to 100%.

Certainly a lot closer than the 67.9% that Windsurf reported. I'm actually not sure what caused it to report 20 AI lines vs 18 "other" lines; I did the test as several separate commits and the IDE commit history shows the first commit adding 1 line to each file and the second commit adding 20, so that should be a total of 21 for both. I did manage to capture the protobuf message the IDE sent for the second commit, and it seems to be showing (correctly) that lines 3 through 21 of ai_file.js were written by the Composer 2 model, and 3–21 of human_file.js were added manually.

Screenshot from the Cursor website, showing "AI Share of Committed Code" as 87.0%. Next to the first bar on the bar graph there is a new bar showing that 100 lines of code were added, all of them labeled as AI.

Thanks to pawitp for this handy protobuf decoder tool.

So I'm not sure why a few lines seem to have gone missing, but regardless the behavior does more or less match what I'd expect from Cursor's description.

Unfortunately, the line-based approach has other flaws that don't show up in this test. For instance, I pasted in a (bogus) 100-line JavaScript file, and then told Cursor to change all the double quotes to single quotes (updating escape characters where necessary). Some might argue that that's an overly simple task to delegate to an LLM (as opposed to an IDE or linter feature), but with some companies giving employees basically unlimited token budgets, and the very low cost of some of the cheaper flash/nano models, I don't think it's that unrealistic. As you'd expect, Composer 2 handled it flawlessly, touching 49 of the 93 non-blank lines in the file.

Screenshot of the Cursor IDE. In the chat I've entered the prompt "Update this file to use single quote strings instead of double quotes".

The main difference between Windsurf and Cursor seems to be color saturation.

The gotcha here is probably pretty obvious. I was expecting to say "see, I added this code manually, but now that Cursor has changed the quote marks it counts all the lines containing quotes as AI-generated". That wasn't what actually happened, though.

Somehow, Cursor counted the entire file as AI, even though we can see from the diff that it left plenty of the lines unchanged. And remember that the entire file is exactly 100 lines long, including some blank ones, so it's not just a case of excluding lines that are considered too simple to be counted. My best guess is that the system that tracks which lines were added by the AI is designed to work with contiguous blocks of code (like drafting an entire function), and if there are too many gaps in the generation it just gives up and calls the whole thing one AI block.

Regardless, this is another case where the AI tool seems to be claiming credit for 100% of the code produced, even though arguably zero lines of code were actually "AI generated", and many of them weren't touched by the tool whatsoever. It looks like both IDEs sometimes wildly overestimate how much they're being used in a coding session.

Weights and Biases

One takeaway here is that it's just very hard to measure the contribution LLMs make to a codebase. Sometimes the best use cases are inquisitive prompts like "Is there already a different solution to this elsewhere in the codebase?" or "Are there any edge cases this logic doesn't cover?", which don't necessarily produce any code at all. On the flipside, I'm a big believer in a philosophy expressed concisely by Jack Diederich:

"I hate code, and I want as little of it as possible in our product."

Measuring the value of an LLM by the number of bytes or lines it produces has all the same problems as measuring developers that way; adding a lot of code doesn't necessarily mean you're adding a lot of value, and sometimes the hardest and most productive work is cleaning up and simplifying what's already there. Besides, when a developer is making heavy use of tab complete, etc. there's not always a clear-cut answer to "was this line of code written by AI", even if you were looking over their shoulder as they wrote the file. So perhaps it's foolish to expect algorithmic answer to that question.

Still, it's notable that the bias always seems to be towards reporting a higher AI percentage. Whether that number is truly meaningful or not, "what percent of my team's code did Windsurf write" is a very appealing statistic for a manager or executive. Execs love announcing that 30%, 75%, even 100% of their code is AI-generated. And of course high numbers are great for AI companies, because they underscore the value they bring to software teams and help justify their high subscription costs. But as a developer, skewed metrics can be harmful. If 50% of my team's code is AI-generated, will management expect features to be implemented twice as fast? If 90% is AI, do we even need a team?

Again to their credit, Windsurf does push back on that type of thinking in their blog post:

"Writing code is not the same as software development. This is only capturing some level of acceleration while writing code, and does not capture time taken in architecture, debugging, review, deployment, and a number of other steps."

To be sure, all metrics are only as good as your understanding of their limitations. If everyone internalizes that these percentages should only be used to compare trends over time, with the absolute values being essentially meaningless (and not comparable across tools), then maybe the details of how they're computed don't matter. But a sentence like "98% of our new code was written by Windsurf" creates a gut feeling that's hard to talk yourself out of, even when you know there are caveats. And I wonder if the impact of these stats could go beyond press releases and 🚀-laiden Slack posts. Since code is protected by literary copyright, and AI-generated works aren't copyrightable, the legal team might get nervous when they hear that the vast majority of their company's code "can be attributed to AI".

Ultimately, I don't really know what percentage of the code I commit is from an AI model. I don't know what the "correct" way to calculate that would be, or if it's worth calculating at all. I'm confident that these tools save me some amount of time, but I also know it's easy to overestimate how much. What I am certain about is that these vendors have a lot of money riding on whether or not AI is fulfilling its grandiose promises; massively accelerating strong developers and completely replacing weak ones.

Perhaps it is, but I'm not going to trust them to measure it.