Bill proposed to outlaw downloading Chinese AI models.

schizoidman@lemm.ee · 1 year ago

Bill proposed to outlaw downloading Chinese AI models.

P03 Locke@lemmy.dbzer0.com · 1 year ago

This literally took one click: https://github.com/deepseek-ai

Stop spreading FUD.

jarfil@beehaw.org · 1 year ago

Where’s the training data?

Crotaro@beehaw.org · 1 year ago

Does open sourcing require you to give out the training data? I thought it only means allowing access to the source code so that you could build it yourself and feed it your own training data.

jarfil@beehaw.org · 1 year ago

Open source requires giving whatever digital information is necessary to build a binary.

In this case, the “binary” are the network weights, and “whatever is necessary” includes both training data, and training code.

DeepSeek is sharing:

NO training data
NO training code
instead, PDFs with a description of the process
binary weights (a few snapshots)
fine-tune code
inference code
evaluation code
integration code

In other words: a good amount of open source… with a huge binary blob in the middle.

Crotaro@beehaw.org · 1 year ago

Thanks for the explanation. I don’t understand enough about large language models to give a valuable judgement on this whole Deepseek happening from a technical standpoint. I think it’s excellent to have competition on the market and it feels that the US’ whole “But they’re spying on you and being a national security risk” is a hypocritical outcry when Facebook, OpenAI and the like still exist.

What do you think about Deepseek? If I understood correctly, it’s being trained on the output of other LLMs, which makes it much more cheap but, to me it seems, also even less trustworthy because now all the actual human training data is missing and instead it’s a bunch of hallucinations, lies and (hopefully more often than not) correctly guessed answers to questions made by humans.

jarfil@beehaw.org · edit-2 1 year ago

There are several parts to the “spying” risk:

Sending private data to a third party server for the model to process it… well, you just sent it, game over. Use local models, or machines (hopefully) under your control, or ones you trust (AWS? Azure? GCP?.. maybe).

All LLM models are a black box, the only way to make an educated guess about their risk, is to compare the training data and procedure, to the evaluation data of the final model. There is still a risk of hallucinations and deceival, but it can be quantified to some degree.

DeepSeek uses a “Mixture of Experts” approach to reduce computational load… which is great, as long as you trust the “Experts” they use. Since the LLM that was released for free, is still a black box, and there is no way to verify which “Experts” were used to train it, there is also no way to know whether some of those “Experts” might or might not be trained to behave in a malicious way under some specific conditions. It could as easily be a Troyan Horse with little chance of getting detected until it’s too late.

it’s being trained on the output of other LLMs, which makes it much more cheap but, to me it seems, also even less trustworthy

The feedback degradation of an LLM happens when it gets fed its own output as part of the training data. We don’t exactly know what training data was used for DeepSeek, but as long as it was generated by some different LLM, there would be little risk of a feedback reinforcement loop.

Generally speaking, I would run the DeepSeek LLM in an isolated environment, but not trust it to be integrated in any sort of non-sandboxed agent. The downloadable smartphone app, is possibly “safe” as long as you restrict the hell out of it, don’t let it access anything on its own, and don’t feed it anything remotely sensitive.

Crotaro@beehaw.org · 1 year ago

Thank you a lot for the load of information! I just now got to reading it all. I was very skeptical about the fact that it is fed by the output of other LLMs but the way you explain it makes sense to me that it might not be that much of a problem. I guess a super blunt analogy could be “It’s only incest if it’s your children” lol

P03 Locke@lemmy.dbzer0.com · 1 year ago

Nobody releases training data. It’s too large and varied. The best I’ve seen was the LAION-2B set that Stable Diffusion used, and that’s still just a big collection of links. Even that isn’t going to fit on a GitHub repo.

Besides, improving the model means using the model as a base and implementing new training data. Specialize, specialize, specialize.

jarfil@beehaw.org · 1 year ago

What about these? Dozens of TB here:

https://huggingface.co/HuggingFaceFW

There is also a LAION-5B now, and several other datasets.

P03 Locke@lemmy.dbzer0.com · 1 year ago

Wow, it’s like you didn’t even read my post.

thingsiplay@beehaw.org · 1 year ago

Nobody releases training data. It’s too large and varied.

That’s why its not Open Source. They do not release the source and its impossible to build the model from source.

thingsiplay@beehaw.org · 1 year ago

Can you actually explain what in my reply is “Fear, uncertainty, and doubt”? Did you actually read it? I even linked to the specific github repository, which is basically empty. You just link to an overview, which does not point to any source code.

Please explain whats FUD and link to the source code, otherwise do not call people FUD if you don’t know what you are talking about.

P03 Locke@lemmy.dbzer0.com · 1 year ago

You’re purposely being obtuse, and not arguing in good faith. The source code is right there, in the other repos owned by the deepseek-ai user.

thingsiplay@beehaw.org · 1 year ago

What are you talking about? What bad faith are you saying to me? I ask you to show me the repository that contains the source code. There is none. Please give me a link to the repo you have in mind. Where is the source code and training data of DeepSeek-R1? Can we build the model from source?

thingsiplay@beehaw.org · 1 year ago

deleted by creator