cross-posted from: https://lemmy.world/post/25011462
SECTION 1. SHORT TITLE
This Act may be cited as the ‘‘Decoupling America’s Artificial Intelligence Capabilities from China Act of 2025’’.
SEC. 3. PROHIBITIONS ON IMPORT AND EXPORT OF ARTIFICIAL INTELLIGENCE OR GENERATIVE ARTIFICIAL INTELLIGENCE TECHNOLOGY OR INTELLECTUAL PROPERTY
(a) PROHIBITION ON IMPORTATION.—On and after the date that is 180 days after the date of the enactment of this Act, the importation into the United States of artificial intelligence or generative artificial intelligence technology or intellectual property developed or produced in the People’s Republic of China is prohibited.
Currently, China has the best open source models in text, video and music generation.


This literally took one click: https://github.com/deepseek-ai
Stop spreading FUD.
Where’s the training data?
Does open sourcing require you to give out the training data? I thought it only means allowing access to the source code so that you could build it yourself and feed it your own training data.
Open source requires giving whatever digital information is necessary to build a binary.
In this case, the “binary” are the network weights, and “whatever is necessary” includes both training data, and training code.
DeepSeek is sharing:
In other words: a good amount of open source… with a huge binary blob in the middle.
Thanks for the explanation. I don’t understand enough about large language models to give a valuable judgement on this whole Deepseek happening from a technical standpoint. I think it’s excellent to have competition on the market and it feels that the US’ whole “But they’re spying on you and being a national security risk” is a hypocritical outcry when Facebook, OpenAI and the like still exist.
What do you think about Deepseek? If I understood correctly, it’s being trained on the output of other LLMs, which makes it much more cheap but, to me it seems, also even less trustworthy because now all the actual human training data is missing and instead it’s a bunch of hallucinations, lies and (hopefully more often than not) correctly guessed answers to questions made by humans.
There are several parts to the “spying” risk:
Sending private data to a third party server for the model to process it… well, you just sent it, game over. Use local models, or machines (hopefully) under your control, or ones you trust (AWS? Azure? GCP?.. maybe).
All LLM models are a black box, the only way to make an educated guess about their risk, is to compare the training data and procedure, to the evaluation data of the final model. There is still a risk of hallucinations and deceival, but it can be quantified to some degree.
DeepSeek uses a “Mixture of Experts” approach to reduce computational load… which is great, as long as you trust the “Experts” they use. Since the LLM that was released for free, is still a black box, and there is no way to verify which “Experts” were used to train it, there is also no way to know whether some of those “Experts” might or might not be trained to behave in a malicious way under some specific conditions. It could as easily be a Troyan Horse with little chance of getting detected until it’s too late.
The feedback degradation of an LLM happens when it gets fed its own output as part of the training data. We don’t exactly know what training data was used for DeepSeek, but as long as it was generated by some different LLM, there would be little risk of a feedback reinforcement loop.
Generally speaking, I would run the DeepSeek LLM in an isolated environment, but not trust it to be integrated in any sort of non-sandboxed agent. The downloadable smartphone app, is possibly “safe” as long as you restrict the hell out of it, don’t let it access anything on its own, and don’t feed it anything remotely sensitive.
Thank you a lot for the load of information! I just now got to reading it all. I was very skeptical about the fact that it is fed by the output of other LLMs but the way you explain it makes sense to me that it might not be that much of a problem. I guess a super blunt analogy could be “It’s only incest if it’s your children” lol
Nobody releases training data. It’s too large and varied. The best I’ve seen was the LAION-2B set that Stable Diffusion used, and that’s still just a big collection of links. Even that isn’t going to fit on a GitHub repo.
Besides, improving the model means using the model as a base and implementing new training data. Specialize, specialize, specialize.
What about these? Dozens of TB here:
https://huggingface.co/HuggingFaceFW
There is also a LAION-5B now, and several other datasets.
Wow, it’s like you didn’t even read my post.
That’s why its not Open Source. They do not release the source and its impossible to build the model from source.
Can you actually explain what in my reply is “Fear, uncertainty, and doubt”? Did you actually read it? I even linked to the specific github repository, which is basically empty. You just link to an overview, which does not point to any source code.
Please explain whats FUD and link to the source code, otherwise do not call people FUD if you don’t know what you are talking about.
You’re purposely being obtuse, and not arguing in good faith. The source code is right there, in the other repos owned by the
deepseek-aiuser.What are you talking about? What bad faith are you saying to me? I ask you to show me the repository that contains the source code. There is none. Please give me a link to the repo you have in mind. Where is the source code and training data of DeepSeek-R1? Can we build the model from source?
deleted by creator