Loading Jaywing website
31 January 2025 / News

DeepSeek R1: Decoding Its Significance

Jaywing

DeepSeek hit the news this week with the launch of its new large language model called R1.  Positioned as an open-source alternative to existing AI models, early indications suggest that it performs well, with the bigger versions out-performing GPT-4o and o1-mini on benchmark tests.

So, does this indicate that China is catching up with the US and becoming a major rival in the world of generative AI, or should we be viewing it differently?

Our Accelerator Lab leader Malcolm Clifford and Head of AI Research & Development Joe Crawforth tell you what you need to know. 

DeepSeek's Rise: From Code to Conversation 

DeepSeek may not have been in the public eye much until the last week or so, but its research has been on the radar for some time. Their first releases back in 2023, called DeepSeek-Coder, was a series of open-source code generation models made for supporting developers and at the time was deemed the best performing model at this task available. 

The Chinese government has placed great emphasis on AI development, encouraging Chinese tech companies to take on the US technology giants. In response, the US government has banned the export of high-end chips to China since 2022 in order to make it harder for Chinese companies to compete. 

Not only has the release of the larger R1 model had an impact on the AI landscape, DeepSeek also released a series of 6 smaller, distilled models based on two previous open-source models from Meta and Alibaba (Llama3 and Qwen2.5). Deepseek finetuned the models using their R1 model in a teacher-student method, producing powerful “reasoning” models that can be run on consumer grade technology. This enables more efficient and accessible AI, giving users the ability to run and develop these incredibly powerful models further on consumer technology. That in turn will allow users to utilise these models within their working processes, with the user or the model not needing to touch the internet once the model is downloaded, keeping all your chats, data and work safe within your local environment. 

Performance vs. Hype: Digging Deeper into the Benchmarks 

It’s true to say that in terms of capability, R1 is an attempt to rival the ability of the latest OpenAI models and the fact that based on the benchmarks it gets close is an achievement in itself.  There is scepticism around the AI community whether its real performance in the hands of users reflects that of its benchmark results. More recent trials performed by users and researchers out in the open world have shown that the real performance is far from that of OpenAI’s o1 and actually much closer to OpenAi’s o1-mini, with prompt sensitivity being one of the main areas of concern. Prompt sensitivity is where tweaks to instructions, essentially saying or meaning the same in pure English, give massively variable outcomes. When it comes to the benchmarks themselves, there have been studies into whether large language models, can genuinely reason or primary depend on token bias (the bias learned towards the use of specific tokens). This measure discussed in by Bowen Jiang et al states:  

“A strong token bias suggests that the model is relying on superficial patterns in the input rather than truly understanding the underlying reasoning task.” [1]

This suggests that using variations of the words from the original benchmark questions significantly changes the quality of their response. This heavily suggests that the questions themselves have been included in the training data and rather than reasoning to the answer, the model has memorised the answer from its training. Ignacio de Gergorio Noblejas, a prominent voice in the AI community has recently taken to LinkedIn discussing whether the benchmarks are a good indication into a model’s “smartness.” [2] He suggests that the current benchmarks may be misleading, focusing more on surface-level accuracy rather than deeper cognitive capabilities. He then goes on to reference work by Jenia Jitsev (unverifiable) who has tested R1 and other models using the Alice in Wonderland (AIW) methodology [3], scoring how language models perform on seemingly simple tasks that are easy for a human. This shows R1 is extremely sensitive to token bias and performs closer to the level of OpenAI’s o1-mini when variations are introduced to the questions for the benchmark (Fig 1.) and more akin to Anthropic’s Claude 3.5 Sonnet when the complexity of the variations is increased further (Fig 2.). 

Fig. 1
Fig. 2

Another metric of performance that holds high value in the AI community is Chatbot Arena, a platform that gives actual users to test the models side by side on everyday tasks and mark their preferred model [4], hold Google’s current experimental thinking model, Gemini-2.0-Flash-Thinking-Exp-01-21, in number 1 position (at time of writing):  

That being said, the other models in the tests shown are closed source, therefore the information available on them outside of benchmarks and usage is limited and they are not available for the research community to continue the development outside of the ringfences of the large tech companies. DeepSeek R1 is currently the most powerful open-source model available. 

Cost Efficiency: The Game Changer? 

The real innovation here is the claim that the model cost only $6m (£4.8m) to train. Whilst OpenAI and Google have never published a cost, estimates are that the training cost of GPT4 exceeded $100m. If accurate, this challenges the assumption that only the most advanced chips can produce the highest performing models, and stocks like Nvidia suffered accordingly.  Multiple organisations are seeking to validate DeepSeek’s claims so the story may run for a while yet. In the past, efficiency has come with trade-offs in capability, so the extent to which this has been mitigated has yet to be fully established. 

Security and Factual Accuracy: Proceed with Caution 

By climbing so overtly into the spotlight, DeepSeek has also come under increased scrutiny. New York based cyber security firm Wiz recently discovered a major security vulnerability in DeepSeek’s database infrastructure which was left exposed without proper access controls. This serves as a stark reminder that data security remains a crucial concern in AI developments.

Many organisations use AI chatbots in an informal way via browser access by employees who haven’t always been trained in best practice. Chatbots are not a suitable place to be uploading private or sensitive data yet many people do so without so much as using the options provided to prohibit the use of their data for model training.  Even major AI providers aren’t immune to data risks, and we fear a certain inevitability that some companies will be found to be in breach of data regulations due to this kind of usage. 

The other cautionary tale is confirmation not to rely on their factual output where factual accuracy is important. All language models can output incorrect facts (known as hallucinations). It is easy to anticipate when the Chinese government inserts a big censorship layer which prevents the model from responding to any political matters the Chinese government deems as ‘sensitive,’ but the factual accuracy of models and the opacity of where judgements and opinions are actually coming from should be a concern for everyone. 

What Does This Mean for Marketers? 

For marketers, DeepSeek R1 raises several important points: 

  • More Affordable AI Solutions If DeepSeek’s efficiency claims hold true, businesses may gain access to powerful AI models at a fraction of current costs. This could lower barriers to AI adoption in industries like marketing, content creation, and customer engagement. 
  • Multilingual AI Competition While Western AI models focus primarily on English, DeepSeek R1 has been optimized for Chinese-language processing. This could make it a compelling option for companies targeting global, multilingual audiences. 
  • Data Security Considerations As AI adoption grows, so do concerns around data security. Brands integrating AI into their marketing stacks must be mindful of privacy risks and ensure compliance with data protection laws. 
  • Prompt Engineering R1’s prompt sensitivity underscores the importance of marketers being skilled in prompt engineering to maximise AI’s potential. 
  • Content Accuracy Like all Large Language models, R1 can “hallucinate”, generate factually incorrect information and bias or misleading information. A careful review of any content produced by the model needs to take place to ensure the view of the output aligns with your brands views and truths. 

Competition and Innovation

As for whether this represents a shift in power towards Chinese AI, we have our doubts. The US tech giants – Google, OpenAI, Meta and Anthropic – continue to lead in cutting-edge AI research backed by their huge resources and the brightest and best talent. DeepSeek’s model is impressive, but perhaps more as a cost-effective alternative than on outright breakthrough. 

But China will certainly keep pushing. The Chinese government has made AI a national priority and this will only serve to intensify competition. DeepSeek R1 is definitely a shot across the bows but the response of Western companies is likely to be to accelerate their own innovation cycles. R1 is a significant development, but not a game changer – yet.

References

[1] Jiang, Bowen, et al. "A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners." arXiv preprint, arXiv:2406.11050, 2024, https://arxiv.org/abs/2406.11050

[2] De Gregorio Noblejas, I. (2025, January 30). Discussion on DeepSeek and OpenAI ChatGPT. LinkedIn. https://www.linkedin.com/posts/ignacio-de-gregorio-noblejas_deepseek-openai-chatgpt-activity-7290022510752944128-BvAK 

[3] Nezhurina, M., Cipolina-Kun, L., Cherti, M., & Jitsev, J. (2024). Alice in Wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models (arXiv:2406.02061v4). https://arxiv.org/abs/2406.02061v4 

[4] Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhang, H., Zhu, B., Jordan, M., Gonzalez, J. E., & Stoica, I. (2024). Chatbot Arena: An open platform for evaluating LLMs by human preference (arXiv:2403.04132 [cs.AI]). https://arxiv.org/abs/2403.04132