AI in Fast Food Drive-Thrus
A business case/system design in a rare instance where I think AI is being used correctly.
It’s 1am, halfway through a 12-hour road trip, and you’re starving. You see those sweet Golden Arches and think, “I could go for a Big Mac right now.” You drive up.
“May I take your order?”
You respond instinctively, not realising—you just spoke to an AI.
Starting in 2023 the attendant doesn’t answer the call at Wendy’s anymore: Leading Drive-Thru Innovation with Wendy’s FreshAI
With some subpar results: Dear Wendy’s, PLEASE get rid of AI ordering at the drive thru
McDonald’s also tried this in 2021 and dropped it after a couple of botched orders: McMishaps: McDonald’s nixes AI drive-thrus after multiple viral mix-ups
Generally, at the moment, it seems the AI models used at McDonalds are not good enough to take an order of a cheeseburger without messing it up.
With reports of 85% accuracy rate, you cannot completely replace the employee behind the window. Yet.
This kind of problem really excites me. I see a clear successful use case for AI and related models. The problem is going to be to understand if this is worth doing. I’ll explain how McDonalds with drive-thrus work, and how they can benefit from this correct use of AI and speech to text models, how to fix it, and what to expect next.
But first, let's talk about the why.
WHY
Why would a fast food chain look at replacing at least 1 full time employee at each restaurant across the country?
The Franchisee Owner
95% of McDonald's are owned by independent owners:
The cost for a franchisee owner of staffing the drive-thru at McDonald's:
Minimum wage: 17.70 CAD
Operating: 365 days, 24/7
Annual cost: ~$155k CAD per location
There are approximately 1400 McDonalds in Canada.
This amounts to a total expense of ~$217M CAD/year on salary alone for all Owned and Franchisee Operated McDonalds.
The benefit to the owner is obvious.
It’s basic mathematics.
McDonald's Corporate Strategy
What was Corporate McDonald's incentive to consider AI in drive-thrus? Based on McDonald’s own Business Model deck for investors: their main revenue comes from Rent and Royalties - rent paid by franchisee owners for properties and royalties based on sales percentage.
But an AI Drive Thru Attendant doesn't generate more sales. You can't have more people drive through McDonald's just because you added AI Drive Thru Attendant. Only one car fits in that lane.
The AI Drive Thru Attendant serves a different purpose:
Eliminating employee training costs for Corporate McDonald's
Standardizing quality for a more consistent value proposition
McDonald's really cares about guest experience, according to their investor deck.
If McDonald's can deliver the same fries, burger, and taste worldwide - from Canada to Saudi Arabia to India to Japan - it also aims to deliver the same experience. The most difficult consistency challenge is human-human interaction. McDonald's has a specific image for each employee it wishes to maintain so guests know what to expect.
The McDonald's app and Kiosk represent the first step toward a consistent universal experience. I can use the same familiar Kiosk in Spain as in Canada. Why not the same drive-thru experience?
The Guest Experience Factor
Why use a drive-thru at all? Today, you can park in a designated spot, order through the mobile app, and wait without risking drive-thru accidents.
My assumption: people still prefer talking to order, don't like using apps, or want to maintain movement momentum. I have no data showing if drive-thru usage decreased since app introduction, but I still use the drive-thru when I want to keep moving. It feels effortful to restart after parking.
So drive-thrus remain valid as long as people have vehicles, places to go, and want food quickly. The convenience factor is unparreallelled.
But what if you could:
Speak to the box in any language?
Just speak, and it recognizes exactly who you are?
Reduce harassment and verbal abuse risk to employees and guest?
An AI Drive Thru Attendant can deliver all this.
If it works properly, guests can speak any language, enjoy voice-activated authentication, and get their order 100% correctly every time, with consistent voice experience at every McDonald's. For corporate, this means quality assurance perfection, with a lower risk of employee harassment and training. For franchisees, it eliminates payroll expenses and increases profits.
Building the AI-Driven Drive-Thru
What's needed to build this system?
Our goal: replace all attendant actions with an AI attendant performing a one-to-one match.
The AI attendant must:
Take orders
Respond to guests
Know current menu, promotions, and upselling techniques
Send orders to Kitchen and Payment Processing
Let's make a generalized assumption about the problem scenario. I'm not an expert on McDonald's internal processes, but here's my take on a typical drive-thru interaction:
Guest enters property and approaches drive-thru
Guest enters drive-thru queue
Guest approaches voice box
Attendant begins Order Transaction
"May I take your order?" prompt delivered
Customer responds
Interaction continues with:
Initial Order
Order Clarification
Promotion Prompting
Customization Requests
Order Confirmation
Upsizing Offers
Price Confirmation
Attendant gives Final Confirmation
Order formatted for Kitchen/Payment API
Customer prompted to proceed to payment
Transaction ends
This requires five components:
1. Speech-to-Text Service For capturing guest voice input
2. Large Language Model To interpret speech-to-text input, recognize orders, understand context, process order steps, and communicate with McDonald's order creation API
3. Text-to-Speech Service For responding to guests in the McDonald's voice
4. Order Processing Service A single event sender called by the LLM to communicate with all relevant systems
5. Data Warehouse For storing all audio files.
Now let's examine some of these services, their feasibility, accuracy, and assess the full system under ideal conditions.
The Speech-To-Text Model.
I'm not certain what technology McDonald's and Wendy's actually use, but some speech-to-text model must be involved. Speech recognition has existed in Computer Science since the 1970s - long before Claude and ChatGPT. The technology has been viable for decades.
Evidence about speech-to-text accuracy is mixed. This 2020 paper argues spontaneous conversation recognition remains challenging.
But drive-thru interaction is structured and objective-based. We can achieve sub-2% WER (Word Error Rate) on trained models
Microsoft's documentation explains that a WER below 5% is considered good. Let’s assume we can use Azure Speech-To-Text Service. Google and AWS have similar ones, and there are smaller companies with other versions too.
In 2025, creating a fast-food-specific speech-to-text model with sub-5% WER is feasible. Let's assume a 5% order error rate due to speech recognition issues.
Cost of this Model
The average time from order taken to received is 151.96 seconds.
Let’s make a very generous assumption 25% of that time is the conversation.
Thus, Ordering Time is then ~37.99sec.
To do speech to text, it is based on number of hours.
The cost is 1$/hour for real time transcription as provided by Microsoft.
Approximately 70% of all orders in McDonalds Canada are made at drive thrus.
According to McDonald’s, they serve 2.5 million people/day. Let’s assume again, they all made one order.
This amounts to 1.875 million/day that are drive thru orders.
At ~37.99sec for a drive-thru order, this is 71.2 million seconds/day of audio.
Let’s also assume, half that audio, is the intake, speech to text.
This amounts to a cost of $9888/day.
Let’s keep it simple and say $10k/day.
This amounts to ~$3.65M/year.
The Text-to-Speech Model
This is 100% achievable in 2025. Numerous companies offer text-to-speech synthesizers with variable speech, tone, and language capabilities. You can even find free options online.
Azure Text-To-Speech. Google GCP.
Given the straightforward response protocols needed, this presents no significant challenge. We'll assume no issues with speech delivery to guests.
Cost of this Model
The Average Words Per Minute spoken by an English Speaker is 150wpm
Thus, the number of words spoken is 94.5 words to make an order.
Text to Speech by Microsoft is $15 per 1M characters.
Above, we determined approximately 1.875 million/day that are drive thru orders.
This amount to 177.18M words for all orders each day.
Half that is the response at 88.59M words for all orders each day.
Each year we are looking at 32B words for the year.
We assume the average character count of an English word is 5.
We are looking at 161B characters/year.
This amounts to ~$2.4M/year.
The LLM Model
I propose a custom-trained LLM specifically designed as a McDonald's drive-thru attendant.
This is a weak point in my own abilities to execute, but I know what is feasible for an experienced developer in this speciality can achieve. These conditions may also be available out the box in a few months.
This LLM needs training to:
Access current menu information
Know when to offer "Would you like to make it a large?"
Know when/how to finalize orders for kitchen preparation
Correct potential speech-to-text errors
Apply basic reasoning capabilities
Current Claude 3 performance metrics.
We don't need advanced coding or math abilities (the payment processor handles calculations). Just high-school-level knowledge and reasoning will suffice.
Let's assume approximately 95% accuracy - a very generous assumption.
Cost of this Model
We can achieve costs as low at $0.05/MTokens for a Llama 3.1 model.
Let’s be generous and say it’s about $1/MTokens.
Per the docs in OpenAI, 1token ~= 4characters.
As defined above, we are looking at 161B characters/year.
This amounts to 40.25B tokens/year.
This amounts to $40k/year.
Not bad, if I did my math right.
Note: I did not include costs to training the model. I will assume that is the bulk of the cost, but is an initial setup cost. I’m looking at regular running costs.
Data Warehouse
This is to store all the audio files that are generated.
Above we determined, 71.2 million seconds/day of audio.
Which is about 25.9B seconds/year.
An mp3 at 128kbps will make this, 414400 GB/year.
Microsoft offers storage at $0.15 per GB on pay-as-you go.
This amounts to ~$62k/year.
Somewhat negligible in comparison to the other costs so far.
Final Costs
Speech-To-Text: ~$3.65M/year.
Text-To-Speech: ~$2.4M/year.
LLM Costs: ~$40k/year.
Audio Storage Costs: ~$62k/year.
A Class 5 estimate, I would begin to start considering is about ~$6M/year to operate the basic technology behind it. And this base operating cost is most likely to decrease over time.
I think the bigger cost would be the development team of this project. McDonald’s doesn’t have to hire IBM or Google to build this out, they could hire a whole in-house development team. Considering that they are saving total expense of ~$217M CAD/year on salary alone across their franchise members, a team of 10, (Devs, QA, AI Dev, Project Mangers etc) could pull this off.
At an approximate expense of running a team like this at ~$2-5M/year, assuming highly competitive wages and the operating costs: my estimate on a final technical operating cost per year could be ~$11M/year (-50%/+100%).
The Ideal Scenario
Two services will deliver inconsistent results:
LLM model: 95% accuracy
Speech-to-text model: 95% accuracy
Simple probability calculation suggests ~90% overall accuracy.
This assumes perfect conditions - no snowstorms disrupting speech recognition, all accents understood, slurred speech interpreted correctly, and handling all voice recognition variables that humans manage effortlessly.
This suggests our current technology stack could theoretically achieve 90% accuracy with off-the-shelf components.
How does this compare to other attempts and the human attempt?
IBM achieved 85% accuracy in 2021 using decision-tree models:
Wendy's reaches 86% accuracy with claims approaching 90%.
For comparison, current Human Drive Thru Attendants achieve an Order Accuracy of average 86%.
Given this comparison point, today you could create this AI Drive Thru Attendant with 90% accuracy. Better than real attendants? Maybe.
I even created a basic proof of concept with Claude: https://claude.ai/share/754e3a44-1bec-4f63-9a1d-5bca0a1180a2
Using completely off the shelf technologies, a bit of plumbing, a whole lot of data and add speech-to-text input and text-to-speech output and you have a working prototype.
My Final Thoughts and Next Steps.
So what's next? We've established theoretical feasibility and a strong financial feasibility. How do we implement it?
McDonald's and the other chains can't focus solely on technical execution.
They must consider:
Legal implications
Customer Compliance
Regulatory Requirements
Ethics board approval
Training requirements
Franchisee roll-out plans
Transition strategies
Data storage contracts
Data processing agreements
Many legal challenges exist, but they're solvable. While not my expertise, I could advise different groups on their requirements.
McDonald's isn't abandoning the concept - just IBM. They're quickly seeking another vendor (sounds like it’s Google), clearly recognizing the value and feasibility.
I could elaborate further on these points or provide a cost breakdown for the team and resources needed, but that's material for another article.
There are a few more things I haven’t spoken about, like a feedback loop mechanism, referral to other orders, voice recognition with repeat customers etc. These are all additional value adds that can improve the guest experience and the AI solution.
This thought exercise on business-technology intersection has been fascinating. I hope you've learned something about fast food and AI applications. Perhaps it will inspire you to discover your own use case.
Let me know what you think!