
AI model developers—those who create neural networks to power AI features—are a different breed. They think in terms of latent spaces, embeddings, and loss functions. Their tools of the trade are Python, Numpy, and AI frameworks, and the fruit of their efforts is operation graphs capable of learning how to transform an input into an insight.
A typical AI developer spends months, if not years, without ever considering how memory is allocated, whether a loop fits in a cache line, or even loops at all. Such concerns are the domain of software engineers and kernel developers. They generally don’t think about memory footprints, execution times, or energy consumption. Instead, they focus, correctly, on one main goal: ensuring the AI model accurately derives the desired insights from the available data.
This division of labor functions well in the cloud AI space, where machine learning and inference utilize the same frameworks, hardware, storage, and tools. If an AI developer can run one instance of their model, scaling it to millions of instances becomes a matter of MLOps (and money, of course).
Firmware in edge AI
In the edge AI domain, especially in the embedded AI space, AI developers have no such luxury. Edge AI models are highly constrained by memory, latency, and power. If a cloud AI developer runs up against these constraints, it’s a matter of cost: they can always throw more servers into the pool. In edge AI, these constraints are existential. If the model doesn’t meet them, it isn’t viable.

Figure 1 Edge AI developers must be keenly aware of firmware-related constraints such as memory space and CPU cycles. Source: Ambiq
Edge AI developers must, therefore, be firmware-adjacent: keenly aware of how much memory their model needs, how many CPU cycles it uses, how quickly it must produce a result, and how much energy it uses. Such questions are usually the domain of firmware engineers, who are known to argue over mega-cycles-per-second (MCPS) budgets, tightly coupled memory (TCM) share, and milliwatts of battery use.
For the AI developer, figuring out the answer to these questions isn’t a simple process; they must convert their Python-based TensorFlow (or PyTorch) model into firmware, flash it onto an embedded device, and then measure its latency, memory requirements, CPU usage, and energy consumption. With this often-overwhelming amount of data, they then modify their model and try again.
Since much of this process requires firmware expertise, the development cycle usually involves the firmware team, and a lot of tossing balls over fences, and all that leads to slow iteration.
In tech, slow iteration is a bad thing.
Edge AI development tools
Fortunately, all these steps can be automated. With the right tools, a candidate model can be converted into firmware, flashed onto a development board, profiled and characterized, and the results analyzed in a matter of minutes, all while reducing or eliminating the need to involve the firmware folks.
Take the case of Ambiq’s neuralSPOT AutoDeploy, a tool that takes a TensorFlow Lite model, a widely used standard format for embedded AI, converts it into firmware, fine-tunes that firmware, thoroughly characterizes the performance on real hardware (down to the microscopic detail an AI developer finds useful), compares the output of the firmware model to the Python implementation, and measures latency and power for a variety of AI runtime engines. All automatically, and all in the time it takes to fetch a cup of coffee.

Figure 2 AutoDeploy speeds up the AI/embedded iteration cycle by automating most of the tedious bits. Source: Ambiq
By dramatically shortening the optimization loop, AI development is accelerated. Less time is spent on the mechanics, and more time can be spent getting the model right, making it faster, making it smaller, and making it more efficient.
A recent experience highlights how effective this can be: one of our AI developers was working on a speech synthesis model. The results sounded natural and pleasing, and the model ran smoothly on a laptop. However, when the the developer used AutoDeploy to profile the model, he discovered it took two minutes to synthesize just 3 seconds of speech—so slow that he initially thought the model had crashed.
A quick look at the profile data showed that all that time was spent on just two operations—specifically, Transcode Convolutions—out of the 60 or so operations the model used. These two operations were not optimized for the 16-bit integer numeric format required by the model, so they defaulted to a slower, reference version of the code.
The AI developer had two options: either avoid using those operations or optimize the kernel. Ultimately, he opted for both; he rewrote the kernel to use other equivalent operations and asked Ambiq’s kernel team to create an optimized kernel for future runs. All of this was accomplished in about an hour, instead of the week it would normally take.
Edge AI, especially embedded AI, faces its own unique challenges. Bridging the gap between AI developers and firmware engineers is one of those challenges, but it’s a vital one. Here, edge AI system-on-chip (SoC) providers play an essential role by developing tools that connect these two worlds for their customers and partners—making AI development smooth and effortless.
Scott Hanson, founder and CTO of Ambiq, is an expert in ultra-low energy and variation-tolerant circuits.
Special Section: AI Design
- The AI design world in 2026: What you need to know
- AI workloads demand smarter SoC interconnect design
- AI’s insatiable appetite for memory
- The AI-tuned DRAM solutions for edge AI workloads
- Designing edge AI for industrial applications
- Round pegs, square holes: Why GPGPUs are an architectural mismatch for modern LLMs
The post Bridging the gap: Being an AI developer in a firmware world appeared first on EDN.