The Illusion of Productivity with LLMs
LLMs are great for narrow tasks, but struggle with systems
Every couple of months I attempt to prompt a subsystem for Strata into existence. This is different from having it do small tasks, api lookups, project planning, and other pointed code development (single methods or classes). It’s been a game changer for these tasks. A subsystem, on the other hand, is an entire featureset from the ground up. I’ve been running this experiment since April 2025. Now, a full eight months later I was finally able to merge code from Claude for a somewhat substantial module. With that revelation, over the next two days I attempted a slightly more complex module. And like every previous attempt this was a fail. All of that work discarded. Instant regret! For startups, time is critical and wasting two days of work feels awful.
This has been an ongoing challenge for me. Two steps forward and two or more steps backward. I was able to build the same complex module with more feature completeness, less lines of code, and better performance in less than a day.
All of the productivity gains feel like like an illusion. I can’t tell if I’m more productive, neutral, or negative. I’m leaning towards neutral to negative.
Maybe I’m wrong. I’m having a perception issue. I could be subconsciously biased against AI’s coming to take our jobs or whatever. You be the judge.
Background
We are building a semantically enabled Business Intelligence application. Our primary focus is on enabling non-technical domain experts (think business analysts, operations folks, finance analysts etc) to maximally self serve data. I first built it at Netflix where it was unbelievably successful for this use case. In fact, over my decades in the industry that was my third attempt at solving this specific problem.
At Strata, we are taking the learnings from that experience and building an even more amazing BI platform. Our tech stack is Ruby on Rails for both frontend and backend.
This being the fourth time I’m building such an application and having recently built it at Netflix, means I have a significant advantage. In fact, in terms of building semantic engines I might be one of a handful of experts in the world.
Now you know what we are building, my advantage, and our tech stack. Let’s take a look at how I’m using AI.
Day to Day with LLMs
Everyday while working on Strata, I’m using both Grok and Claude. I find Claude more capable for coding tasks in Ruby. Grok on the other hand is better at doing deep research on specific topics. I’m going back and forth between these two. But in general, 80% of the time I choose code output from Claude over Grok. On the backend I use Claude Code and my code editor and setup is NVIM.
These are all the successful ways I’m using LLMs:
API lookup / Search. I rarely have to double check the original documentation. Although every now and then it still does make up things that doesn’t exist. This basically replaces search. Rating: 9/10
Sample Code Gen. For example, sample code for manipulating a complex multidimensional array in a specific way. I then take the code and modify it for my needs. In the past, I might be browsing stack overflow for examples. So far, only occasionally does it surprise me with a better implementation than I would have come up with. I still have to modify it quite a bit for my specific needs. Rating: 7/10
Translation. Translating an existing implementation for other use cases. For example, I have a YAML config system that defines Strata behavior for each target database type. LLM’s could easily translate these YAML for all our target databases with pretty high accuracy. Only issue is you need to double check because there are subtle mistakes. Rating: 7.5/10
Refactoring. Assisting with refactoring is huge. It works really great when you need to rename classes or relocate them. It’s pretty good at finding all the references even when variable names are not substantially related. Its basically a supercharged search and replace. Rating: 8/10
Code Review. For somewhat complex classes and methods I’ll have it review my work for obvious improvements. Its somewhat mediocre here. But a good sounding board. Rating 5/10
On a day to day usage for these small tasks it looks like I’m getting a huge productivity boost. How about for a moderate subsystem?
Formatting Module
Over Christmas break, I spend about a day and half building a new formatting system with Claude. In Strata, data can be formatted based on data types, rules at the semantic layer, or rules at the reporting layer. We need to support number and date formatting in addition to advanced formatting via HTML templates and Javascript functions.
Additionally, the formatting module has to apply the formatting rules for backend processing, export formatting rules for client side formatting, export excel and google sheets compatible formatting, and parse formatting rules into appropriate JSON structures.
This is not a super complicated project but it is substantially more than writing a single code snippet. I wish I had tracked the exact amount of time I spent on this. This was over the holiday break and I was squeezing in work while kids were sleeping or away playing. I’m estimating about a day and half to maybe two days.
It required two complete builds before getting it close to correct. Iteration one was way over the top. It didn’t use any of the out the box formatting features that Rails provides. Claude built everything from scratch for some reason. I try to prompt it back to simpler solution but the code just got messier.
Having learned how Claude responded to the first prompt, I created a second prompt (these are detailed markup docs not one liners) with specific instructions on simplicity and usage of Rails existing capability. This worked pretty well. I probably refactored about 20% of the code for correctness, clarity, and organization. This is the largest chunk of code that I have commited from AI so far. And this is the only module/feature level code I’ve accepted from an LLM.
Am I truly satisfied with this implementation? I’d say no. I probably have at least one more refactoring to do for correctness. This is mostly around the use of locales. And, to get the right locale other parts of the system needs to modified. Claude never considered locale at all.
Feels like success on the surface. But, I can see the tech debt. Debt that will come due soon.
Are my feelings here a perception problem? I feel like I could’ve built a more complete system in the same time or with an extra day or two. Maybe I’m over estimating my capabilities. Let’s explore my next LLM project that is substantially more complicated that I ended up rewriting. This may answer the questions posed here.
Chart Rendering Module
Strata is a Business Intelligence tool. Of course, we have to take raw datasets and transform them into beautiful visualizations. Our target users just need basic chart types: line, bar, area, donut, stacked area, and stacked bars. Maybe a few more. We already have systems to generate queries, execute against a data warehouse, cache results, and reuse cached results. We also have a beautiful editor where users can configure how they want the chart to render (x, y, and series axis. Tooltips and row axis).
Now we just need to transform the data to be rendered by Chart.js. This transformation requires understanding how to leverage series when its a dimension vs when its a measure. It has to understand how to handle multiple measures on the y axis and what to do when dual axis is enabled. Finally, it needs to create chart multiples when row axis is configured. Much of the work is transforming data into Chart.js compatible JSON object. As a bonus, we also need to enable color themes for rendering many series.
I made three attempts at building this from scratch. Each time refining the detailed spec describing everything from code style to requirements. On the third attempt I’d estimate we reached 70% feature completeness, but only 50% correctness. Complete means it responds to a certain scenario with an output that looks right. Correctness is how robust the the feature is against bugs. This is my qualitative assessments so you can take it with a grain of salt. Neither correctness nor completeness are the reason this attempt failed. In fact, the chart rendered beautifully with color themes and formatting. With some more prompting and refining I was able to improve correctness and completeness. However, I did note the lines of code and structure seemed more complex than what I would do. But, I let it proceed as I was intent on getting to success here. Only to end up scrapping the whole thing.
So what happened?
Well in certain situations the chart was taking over one second to render even with fewer than a 100 data points. Most of the second day was spent prompting our way into diagnosing the issue. The one second to render was pure client side times. Nothing to do with the server response (although that wasn’t amazing). In the end, neither I nor Claude could figure out why this was happening. Hours of prompting and no luck. I know the Chart.js library well and in no way should it take 1+secs to render.
Instead of deleting all of Claude’s work as I normally do in this situation, I decided to create a fresh branch of master and hand roll this whole module. Good for comparison. In one day I was able to get to 50% feature completeness and 90% correctness. (Correctness is subjective, there maybe bugs I have not found or thought to look for). I probably wrote less than 20% lines of code compared to Claude and only needed two new files. Whereas, Claude had 6 new files and way more lines of code. But the biggest win, only 14ms to render 1000+ data points on Chart.js. This is without doing any special performance optimization. I still have no idea why the Claude implementation had a problem here.
I lost two days of productivity here. But was I faster on my hand roll because of those two days? That’s what makes it hard to tell. AI’s are becoming part of our habit which is making it hard to tease out the gains.
Summary
My ongoing experiments with full module implementations are erasing some of the gains I’m making with day to day use cases listed above. If I stick to those simple tasks, I would say I’m 3x more productive. But letting AI take on bigger tasks is setting me back days. I don’t have an intuitive feeling of being more productive hence why I’m leaning toward neutral to negative productivity gain.
Maybe I should stop these experiments. But I can’t, nor can you. There is too much influential hype about what others are able to achieve. This leads to a nagging feeling that your not prompting right. That its your fault. Or, that the new version is substantially better and can finally get the job done.
Or, it’s all an illusion. We have a talking machine that tells you things with great confidence. That kind of confidence plus a sense of authority is persuasive.


