Benchmarking AI for Digital Marketing
There are a lot of benchmark setups for LLMs, mostly about software development, writing, reading comprehension etc. So I wanted to know if something similar can be done for digital marketing. Of course, it’s quite difficult to create a question set for the whole digital marketing, and in fact, it would be quite subjective even if I attempted to do so. But I thought it might be a good idea to use the marketing platforms’ own certification questions. Important to note here that these questions do exist on the internet, so they are potentially included in the training data of the models. So, this is not a measure of LLM intelligence, but it’s more of a test of publicly available knowledge.
Methodology
- I have chosen Google Ads - Search Certification in Google’s Skillshop platform. I thought this is the most popular product in the most popular digital marketing platform.
- I have gathered 93 questions along with 4 choices and their correct answers. I did not include “select multiple” questions because I was too lazy to write the code for that.
Example question format:
{
"question": "How many responsive search ads can you have in Google Ads?",
"options": [
"There’s a limit of six enabled responsive search ads per ad group. If you have text that should appear in every ad, you must add the text to either Headline position 1, Headline position 2, or Description position 1",
"There’s a limit of five enabled responsive search ads per ad group. If you have text that should appear in every ad, you must add the text to Headline position 1, Headline position 2, and Description position 1.",
"There’s a limit of three enabled responsive search ads per ad group. If you have text that should appear in every ad, you must add the text to either Headline position 1, Headline position 2, or Description position 1",
"There’s no limit of enabled responsive search ads per ad group. If you have text that should appear in every ad, you must add the text to either Headline position 1, Headline position 2,\nor Description position 1"
],
"correctAnswer": "There’s a limit of three enabled responsive search ads per ad group. If you have text that should appear in every ad, you must add the text to either Headline position 1, Headline position 2, or Description position 1"
}
- The benchmark runs on the models below, asks all 93 questions to all the models one by one, then calculates the accuracy percentage of each model.
- The benchmark also asks models to provide reasoning for each answer as well. I realized this improves scores by around 5% for non-thinking models. Basically forcing them to think before answering.
- The prompts are exactly the same to all models.
{ name: "google/gemini-2.5-flash", temperature: 0 },
{ name: "google/gemini-2.5-flash-lite", temperature: 0 },
{ name: "google/gemini-2.0-flash-001", temperature: 0 },
{ name: "google/gemini-2.0-flash-lite-001", temperature: 0 },
{ name: "google/gemini-2.5-pro", temperature: 0 },
{ name: "deepseek/deepseek-chat-v3-0324", temperature: 0 },
{ name: "deepseek/deepseek-r1-0528", temperature: 0 },
{ name: "moonshotai/kimi-k2", temperature: 0 },
{ name: "openai/gpt-4.1", temperature: 0 },
{ name: "openai/gpt-4.1-nano", temperature: 0 },
{ name: "openai/gpt-4o-mini", temperature: 0 },
{ name: "openai/gpt-4o", temperature: 0 },
{ name: "openai/o4-mini", temperature: 0 },
{ name: "mistralai/mistral-small-3.2-24b-instruct", temperature: 0 },
{ name: "qwen/qwen3-235b-a22b", temperature: 0 },
{ name: "qwen/qwen3-32b", temperature: 0 }
I also tested with alternative temperature values, but I found temperature 0 to be the best not only for accuracy, but also for reproducibility. With temperature 1, some models fluctuated around 10-15% with each run.
You may be wondering why I did not include Claude, one of the frontier models, and it’s basically because Claude does not support structured outputs and this benchmark relies on structured outputs via OpenRouter.
Results
- Model size and performance do not seem to correlate in this benchmark. For instance, mistral-small, a 24b model that can be run locally in a strong gaming PC, beats Deepseek R1.
- Similarly, some older models beat some newer models. For instance, Gemini Flash 2.0 performs better than 2.5.
- Qwen3 seems to be the best among the open source models.
- Deepseek R1 is very disappointing since it’s such a great model in other fields.
- Kimi K2 actually performs way better than it seems here. The problem is instruction following. The benchmark code is based on the model giving answers as a single letter as “A” or “B”, but Kimi K2 insists on giving answers as “B - something something” even when explicitly instructed otherwise. But this is also an important factor in developing apps with LLMs, so I believe it’s fair to list it as 58%.
- Most importantly, almost all models actually pass the test as the passing score is 80%. This shows that these certifications don’t matter as much as they used to.
Overall, the results have surprised me because they don’t follow the general consensus of model performance, or other benchmarks in other fields. So, I will probably test with other question sets as well. Maybe Analytics and Hubspot.
What about the price?
Well, if we’re talking about LLMs, I believe we should always include pricing as well. That paints a whole different picture.
Yes, Gemini 2.5 Pro absolutely kills it in terms of performance, but it also comes with a crazy price difference. Compared to Gemini Flash 2.5, it is 40 times more expensive for each correct answer. For an app like this, this is a huge price difference while the accuracy drops from 96.7% to 89.1% only.
However, also note that this is not a price per token comparison. The chart above uses the actual cost of running the benchmark. So, for instance, Gemini 2.5 Pro is not only more expensive, but it also uses a lot more tokens per questions because it’s a thinking model. I believe this is how LLM price comparisons should be made, since it reflects real-life usage more accurately.
If you want to run the benchmark yourself, you can find the code in my GitHub. Each run costs around $2.