The Deep Seek AI bomb finally hit Wall Street yesterday, and now everyone knows who Deep Seek is.
Article overview
Deep Seek’s surprise public debut sent shockwaves through global markets, wiping nearly $600 billion from Nvidia’s value and slashing a cumulative $1.5 trillion from Silicon-Valley darlings in a single day. The author argues the Chinese startup’s rapid rise traces back to stolen OpenAI training data allegedly smuggled by murdered engineer Suchir Balaji and distilled into a lean, high-performing model. Now crowned by Beijing, new CEO Liang Wenfeng embodies the firm’s overnight transformation from obscurity to geopolitical juggernaut.
Silicon Valley ignored repeated warnings—delivered in December 2024 and again on 3 January—that Deep Seek had weaponized Balaji’s drive, even as custodians at Stanford hinted Google had similarly siphoned medical records. Yesterday’s historic sell-off finally forced investors and venture capitalists to confront how the upstart leapfrogged entrenched giants like Meta, Grok, and OpenAI. Lawsuits, finger-pointing, and theories about covert GPU pipelines now dominate tech-sector chatter.
Two narratives duel online: one claims China secretly imported 50 000 state-of-the-art Nvidia Blackwell H-100 chips; the other insists Deep Seek squeezed miracle gains from just 2 000 older Hopper GPUs. Either scenario underscores a security weak point—Taiwan manufactures almost all premium AI chips, sitting only 82 miles from the mainland. Analysts fear that hardware leverage, paired with purloined datasets, could permanently tilt AI supremacy eastward.
Finally, the author revisits knowledge-distillation: the process of shrinking a vast “teacher” model into a nimble “student” well suited to tasks like acing the USMLE medical exam. Such distilled systems, he contends, explain Deep Seek’s uncanny scores on physician-licensing tests without granting the model general intelligence. In his seven-part “Deep Seek and Destroy” series, he promises deeper dives into Schmidt, DeepMind, and Kai-Fu Li’s roles in brokering this east-west data coup.
People and entities
Liang Wenfeng—ex-hedge-fund trader turned Deep Seek CEO—rose from anonymity to shaking hands with China’s premier after the market quake. Suchir Balaji, the slain OpenAI contractor, allegedly spirited a distilled ChatGPT-4 training drive that fueled Deep Seek’s six-week sprint past Western rivals. Alexander Wang of Scale AI tries to reassure Davos elites by blaming 50 000 stealth GPUs, while New York Times reporter Cade Metz cites only 2 000 older chips.
Big-tech titans appear as foils: Mark Zuckerberg (Meta), Sam Altman (OpenAI), Elon Musk (first investor in Altman’s AI), and Jeff Bezos now watch their valuations bleed. Hardware linchpin Nvidia suffers record losses, its Taiwan-made GPUs exposing U.S. supply-chain vulnerabilities. Behind the curtain loom power brokers—Eric Schmidt, DeepMind founders, and Chinese VC legend Kai-Fu Li—who, the author claims, orchestrate the true AI gold rush now ripping through Wall Street and Washington alike.
Stocks like Nvidia, which makes the graphical processor units for AI, are still sliding on the second day of the Deep Seek hangover. Nvidia’s had almost $600 Billion loss in valuation in one day - a Wall Street record.
I have been harping on how Deep Seek stole the training data from Open AI from the murder of Suchir Balaji since the December 19th, 2024 presentation in Silicon Valley. Now that Deep Seek is a one day international sensation, perhaps my story will get more traction now.
Meanwhile, Deep Seek’s CEO went from unknown to instant celebrity in China, meeting with the Premier and his retinue yesterday.
Liang Wenfeng, a hedge fund trader, became the CEO of Deep Seek AI.
Not far from where I presented the Deep Seek info in Silicon Valley on December 19th, billionaire Mark Zuckerberg and Meta is headquartered. Of course, Meta took no notice of the warning.
Just down the street, Sam Altman had his Y Combinator Silicon Valley incubator before he found Open AI, and the restaurant where Elon Musk wrote the first check to Sam Altman for AI is just a few more steps down the street from the Rosewood Hotel.
No Open AI or Grok employees attended the talk, but the Stanford medical records custodian did. He said Google had taken all of Stanford’s medical records without permission in one of their joint ventures in health care management.
I presented the case for Deep Seek’s impact on Silicon Valley again on January 3rd. My strategy was that Silicon Valley knew AI, and they would want to hear my story about Deep Seek murdering Suchir Balaji for the AI training data to leapfrog and their competitors.
I even wrote a seven-part series called “Deep Seek And Destroy” on this Substack.
Understanding The Motive For Killing Suchir Balaji
While taking your 2025 family vacation this year, you see a road sign that says, “Train Your AI, Only Four Years For Under $100 Million”. You think, “Boy, I would love to create an AI Language Model, but I just don’t have $100 Million in cash or four years to spend training the Language Model.”
But nothing got the attention of Silicon Valley like a one Trillion and a half dollar haircut in market losses with no end in sight yesterday. Now, people want to know how Deep Seek leapfrogged the competition.
https://nypost.com/2025/01/27/business/nvidia-shares-fall-12-as-chinese-ai-startup-deepseek-triggers-panic/?utm_campaign=iphone_nyp&utm_source=twitter&utm_medium=social
Silicon Valley venture capital firms and the squawk boxes are issuing recriminations, hurling insults, and threatening lawsuits this morning. Never have the darlings of Silicon Valley fallen out of favor so quickly in one day.
Nvidia has not fallen 17%, with no hint of any price resistance yet to prevent a lower plunge. All manner of theories have abounded from Chinese sneaking GPU chips into China to radical new coding methods that leave the American billionaires like Elon Musk, Jeff Bezos, Sam Altman, and Mark Zuckerberg in the dust. I was being very specific on January 3rd of who trained Open AI with Suchir Balaji, and how his training drive did the same for Deep Seek.
https://x.com/RealGeorgeWebb1/status/1875239388906762736
I also introduced you to Alexander Wang of Scale AI last year, and this year at Davos, Wang is doing his best to calm the AI training nerves of everyone. Wang promises that the CCP snuck 50,000 Nvidia GPUs into China to do the training of their AI system for Deep Seek.
Interestingly, no one in the press asked Alexander Wang where these processors were and how did Deep Seek get them. The Davos and WEF press just wrote it down, “50,000 secret GPUs in China”. That may explain the Deep Seek leapfrog over the competition, and it may not. That remains to be seen.
By contrast, the New York Times and AI reporter Cade Metz says Deep Seek only used 2,000 older “Hopper” Nvidia chips to leapfrog all of Silicon Valley’s billionaires and their AI platforms in a single bound.
https://www.nytimes.com/2025/01/23/technology/deepseek-china-ai-chips.html
Whether the CCP got 50,000 of the latest Blackwell H-100 GPUs or not, almost all of the AI chips are made in Tawain, just 82 miles from Mainland China. And this is a glaring National Security vulnerability.
https://www.bakerinstitute.org/research/silicon-hegemon-could-china-take-over-taiwans-semiconductor-industry-without-invading
I have covered the key players from Google like Eric Schmidt and the founders of Deep Mind, who are certain to be at the center of the dealmaking with Kai-Fu Li in China.
Return here to the Paid section to find out the details. We have said Suchir’s training drive was used to train Deep Seek for six weeks. Now the experts are agreeing.
This morning, two AI camps are igniting the internet with their versions of the truth. One camp is the 50,000 GPU camp, which claims that China has 50,000 of the latest and greatest Blackwell Nvidia GPUs hidden in a data center.
The other camp claims only two thousand old Hopper GPUs were used, and the Chinese were GPU magicians to get 10 times the performance from their chips.
In my seven-part series, I have talked about the murder of Suchir Balaji, the AI Whistleblower responsible for the AI training of ChatGPT’s ChatGPT-4, and how a “distilled model” could have been extracted on a removable hard drive from the massive training feed of Petabyte size with ChatGPT.
For some reason, the public just won’t accept the explanation the Chinese Communist Party would kill one twenty-six-year-old male for a Trillion dollars worth of AI training material, especially if they were medical records. Deep Seek is scoring very high on the USMLE test that certifies doctors, indicating Deep Seek Release 1 had special training with medical records.
By building special purpose language models through a process called distillation, a purpose build model for taking certain tests can surprisingly outperform most humans, but that does not mean the AI model has a general intelligence.
For instance, just the datasets needed for scoring well on the USMLE could have been downloaded, resulting in a much smaller and much faster model.
I will let ChatGPT explain how this is done.
What process is used to make a smaller scaled-down language model from a much larger one?
ChatGPT
Thought about creating smaller language models for 5 seconds
One of the most common approaches for creating a smaller “scaled-down” version of a large language model is knowledge distillation (often just called distillation). In a nutshell, knowledge distillation uses the large model (the “teacher”) to guide the training of a more compact model (the “student”).
Below is an overview of the major steps and relevant techniques that can be used:
1. Knowledge Distillation
Teacher Model Inference
You begin with the large, pre-trained model (the “teacher”).
You feed unlabeled (or labeled) data through this teacher model to capture its outputs (either the logits—i.e., probability distributions over tokens—or hidden representations).
Student Model Training
Instead of training the smaller model (the “student”) from scratch directly on labeled data, you train it to match the “soft” outputs from the teacher model.
This “soft” target distribution encodes the teacher’s learned representations. By mimicking the teacher’s output probabilities, the student learns not only correct answers but also how the teacher distributes probability among possible alternatives—this often provides more nuanced guidance than hard labels alone.
Loss Function
The student’s training loss often includes both:
Distillation loss: a measure (e.g., KL divergence) of the difference between the student’s and teacher’s probability distributions.
Standard supervised loss: a measure (e.g., cross-entropy) on the original labeled data (if available).
Balancing these components in the final loss encourages the student to learn both from the teacher’s knowledge and from direct supervision on ground truth labels.
Temperature
In the distillation process, it’s common to apply a “temperature” parameter TTT to the teacher model’s logits, making the probability distribution less “peaked.” This helps the student learn a broader range of knowledge rather than just the single highest-probability answer.
Pros of Knowledge Distillation
Performance: The student model can inherit much of the performance of the teacher without needing the same large scale.
Flexibility: Works even when you only have the teacher model and unlabeled text (labeling not strictly required if you use the teacher’s predictions).
Efficiency: Distilled models can be faster, use less memory, and still retain strong performance.
George, again you have been the 'Tip of the Spear' on this A.I. transference or Bio Digital Convergence! Stay true to mission 💯 TRUTH ☝️🙏
George, You were early and on the money. I sold off NVDA before the crash thanks to you but all others killed my portfolio. I'm in Northern CA and would like to see you again if you get back to Redwood City, Marin county or anywhere near.