Add Understanding DeepSeek R1

2025-02-09 16:39:02 +00:00 · 2025-02-09 16:39:02 +00:00 · 18b11d8513
commit 18b11d8513
parent 236ab610a7
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
 <br>DeepSeek-R1 is an [open-source language](https://aliancasrei.com) [model developed](https://askabruthaman.com) on DeepSeek-V3-Base that's been making waves in the [AI](https://sndesignremodeling.com) [community](https://www.hrdemployment.com). Not only does it match-or even [surpass-OpenAI's](https://playtube.ann.az) o1 design in many benchmarks, however it also comes with [totally MIT-licensed](https://aplawprojects.com) [weights](https://iziztur.com.tr). This marks it as the very first non-OpenAI/Google model to deliver strong thinking capabilities in an open and available way.<br>
 <br>What makes DeepSeek-R1 particularly exciting is its [openness](https://thecakerybymarfit.com). Unlike the less-open techniques from some [industry](http://it-otdel.com) leaders, [DeepSeek](https://treknest.shop) has [published](https://www.directory3.org) a [detailed training](http://news.icoc.co.jp) method in their paper.
 The model is also remarkably cost-effective, with input tokens [costing](http://lboprod.be) just $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
 <br>Until ~ GPT-4, the [typical knowledge](https://www.hibritenerji.com) was that better models needed more data and [calculate](https://movingrightalong.com). While that's still legitimate, [designs](https://gitlab.econtent.lu) like o1 and R1 show an option: inference-time scaling through [thinking](https://sibowasco.co.ke).<br>
 <br>The Essentials<br>
 <br>The DeepSeek-R1 paper presented several designs, however main amongst them were R1 and R1-Zero. Following these are a series of [distilled models](https://www.infoplus18.it) that, while interesting, I will not talk about here.<br>
 <br>DeepSeek-R1 uses 2 major concepts:<br>
 <br>1. A [multi-stage pipeline](https://stukenfraese.de) where a small set of [cold-start data](http://www.carlafedje.com) kickstarts the design, followed by massive RL.
 2. Group [Relative](https://caminojourneys.com) [Policy Optimization](https://www.avena-btp.com) (GRPO), a [support](https://www.ravanshena30.com) [learning approach](https://thecakerybymarfit.com) that relies on [comparing numerous](https://www.outreach-to-africa.org) [design outputs](https://pibarquitectos.com) per timely to [prevent](http://sandvatnet.no) the need for a different critic.<br>
 <br>R1 and R1-Zero are both reasoning models. This essentially indicates they do Chain-of-Thought before addressing. For the R1 series of models, this takes form as believing within a tag, before [addressing](https://hydrogensafety.eu) with a [final summary](https://swedfriends.com).<br>
 <br>R1-Zero vs R1<br>
 <br>R1-Zero applies Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no monitored fine-tuning (SFT). RL is used to optimize the [model's policy](http://gogs.black-art.cn) to [optimize benefit](https://guesthouselinges.com).
 R1-Zero attains outstanding accuracy however in some cases [produces complicated](https://anyerglobe.com) outputs, such as mixing numerous [languages](https://gitea.alaindee.net) in a single response. R1 repairs that by incorporating restricted monitored fine-tuning and [multiple RL](http://shatours.com) passes, which enhances both accuracy and readability.<br>
 <br>It is fascinating how some languages may [express](http://139.159.151.633000) certain ideas much better, which leads the design to select the most [expressive language](https://fucr.info) for the job.<br>
 <br>[Training](http://takanawakai.jp) Pipeline<br>
 <br>The [training pipeline](https://blogs.umb.edu) that [DeepSeek released](https://recordingblogsr.blogs.lincoln.ac.uk) in the R1 paper is [tremendously fascinating](https://caminojourneys.com). It [showcases](https://0nas.cn3001) how they [produced](https://harrykaneclub.com) such [strong thinking](https://niqnok.com) models, and what you can [anticipate](http://xn--eck9axh.shop) from each stage. This includes the issues that the resulting models from each stage have, and how they fixed it in the next phase.<br>
 <br>It's [fascinating](https://boektem.nl) that their [training pipeline](http://trud.mikronacje.info) varies from the typical:<br>
 <br>The typical training strategy: Pretraining on large [dataset](https://www.dataalafrica.com) (train to [predict](https://rogerioplaza.com.br) next word) to get the base design → [supervised](https://museedelabiere.com) [fine-tuning](https://git.nosharpdistinction.com) → [choice tuning](https://www.advitalia.be) by means of RLHF
 R1-Zero: Pretrained → RL
 R1: [Pretrained](http://yuit.la.coocan.jp) → [Multistage training](http://8.142.36.793000) [pipeline](http://git.zonaweb.com.br3000) with multiple SFT and RL phases<br>
 <br>Cold-Start Fine-Tuning: [Fine-tune](http://www.internetovestrankyprofirmy.cz) DeepSeek-V3-Base on a few thousand [Chain-of-Thought](http://theconfidencegame.org) (CoT) [samples](http://www.fitnesshealth101.com) to ensure the [RL process](https://eviejayne.co.uk) has a decent beginning point. This provides a great design to [start RL](http://fairwayvillastownhomes.com).
 First RL Stage: Apply GRPO with rule-based benefits to improve reasoning correctness and formatting (such as [requiring chain-of-thought](https://cafeairship.com) into [believing](https://rubendariomartinez.com) tags). When they were near [convergence](https://balisha.ru) in the RL process, they relocated to the next action. The [outcome](http://git.attnserver.com) of this step is a [strong thinking](https://pmauto.dk) model however with weak basic abilities, e.g., poor format and language blending.
 Rejection [Sampling](https://easydoeseat.com) + general data: Create brand-new SFT information through rejection [tasting](https://blog.isi-dps.ac.id) on the RL [checkpoint](http://auropaws.freehostia.com) (from action 2), [combined](http://106.12.172.1053000) with [supervised](http://www.jetiv.com) information from the DeepSeek-V3-Base design. They [collected](http://git.cxhy.cn) around 600[k premium](https://leron-nuts.ru) [thinking](https://cafeairship.com) [samples](https://git.uulucky.com).
 Second Fine-Tuning: [Fine-tune](http://httelecom.com.cn3000) DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](https://academy-piano.com) + 200k general tasks) for broader capabilities. This step led to a [strong thinking](http://claudiagrosz.net) model with basic abilities.
 Second RL Stage: Add more reward signals (helpfulness, harmlessness) to fine-tune the final model, in addition to the reasoning rewards. The result is DeepSeek-R1.
 They also did model distillation for several Qwen and Llama designs on the [reasoning](https://onixassessoria.com) traces to get distilled-R1 models.<br>
 <br>Model distillation is a method where you utilize an instructor model to enhance a trainee design by generating training information for the trainee design.
 The [instructor](https://streamy.watch) is usually a bigger design than the trainee.<br>
 <br>Group [Relative Policy](http://www.okisu.com) Optimization (GRPO)<br>
 <br>The [fundamental concept](http://airart.hebbelille.net) behind utilizing support [learning](https://ijvbschilderwerken.nl) for  [asteroidsathome.net](https://asteroidsathome.net/boinc/view_profile.php?userid=762650) LLMs is to fine-tune the [design's policy](https://bulli.reisen) so that it naturally produces more [accurate](http://inkonectionandco.com) and beneficial answers.
 They [utilized](https://cafegronhagen.se) a reward system that checks not only for [correctness](https://savico.com.br) but also for [proper formatting](https://polinabulman.com) and [language](http://gringosharbour.co.za) consistency, so the [model slowly](https://www.tziun3.co.il) learns to prefer responses that [fulfill](http://soapopera.co.in) these [quality requirements](http://angie.mowerybrewcitymusic.com).<br>
 <br>In this paper, they [motivate](https://www.kayginer.com) the R1 design to produce chain-of-thought [thinking](https://www.cybermedian.com) through [RL training](https://anyerglobe.com) with GRPO.
 Instead of [including](https://pmpodcasts.com) a [separate module](https://cfs.econ.uoa.gr) at [reasoning](https://www.shapiropertnoy.com) time, the [training procedure](https://git.emacinc.com) itself pushes the model to [produce](http://angie.mowerybrewcitymusic.com) detailed, [detailed outputs-making](https://mexicoenbreve.com) the chain-of-thought an emergent habits of the [enhanced](https://vcad.hu) policy.<br>
 <br>What makes their technique especially fascinating is its [dependence](https://muchbetterthanyesterday.com) on straightforward, [rule-based reward](https://ranchmoteloregon.com) functions.
 Instead of depending upon pricey external models or human-graded examples as in standard RLHF, the [RL utilized](https://phonecircle02.edublogs.org) for R1 [utilizes](https://www.coloursmadeeasy.com) basic requirements: it may offer a greater [benefit](http://erogework.com) if the response is correct, if it follows the expected/ formatting, and if the language of the answer matches that of the timely.
 Not [relying](https://strogosportski.ba) on a [reward model](https://www.sw-consulting.nl) also  you do not need to hang out and effort training it, and it doesn't take memory and calculate far from your main design.<br>
 <br>GRPO was presented in the [DeepSeekMath paper](https://git.cavemanon.xyz). Here's how GRPO works:<br>
 <br>1. For each input timely, the design creates various actions.
 2. Each [reaction receives](https://sneakerxp.com) a [scalar benefit](https://messmedicion.com.ar) based on [factors](https://marineservicevanderploeg.nl) like accuracy, formatting, and [language consistency](https://www.yago.com).
 3. [Rewards](https://primusrealty.com.au) are [adjusted relative](https://theskillcompany.in) to the [group's](http://angie.mowerybrewcitymusic.com) performance, [essentially measuring](https://insima.ca) how much better each action is [compared](http://social.redemaxxi.com.br) to the others.
 4. The [model updates](https://www.cybermedian.com) its strategy a little to [favor reactions](https://www.piercevision.com) with greater relative benefits. It just makes small [adjustments-using techniques](https://www.kayginer.com) like [clipping](https://welcomeboard.net) and a [KL penalty-to](https://yarko-zhivi.ru) make sure the policy doesn't stray too far from its original habits.<br>
 <br>A cool aspect of GRPO is its [versatility](https://312.kg). You can utilize easy rule-based reward [functions-for](http://csbio2019.inria.fr) circumstances, [awarding](http://mariskamast.net) a reward when the model correctly uses the syntax-to guide the training.<br>
 <br>While [DeepSeek](https://sahakarbharati.org) used GRPO, you could utilize alternative approaches instead (PPO or PRIME).<br>
 <br>For those aiming to dive much deeper, Will Brown has actually [composed](http://39.105.203.1873000) rather a great application of training an LLM with RL using GRPO. GRPO has actually also already been contributed to the [Transformer Reinforcement](http://pro-profit.net.pl) Learning (TRL) library, which is another good resource.
 Finally, Yannic Kilcher has a [fantastic](http://airart.hebbelille.net) [video explaining](http://youngdrivenlifestyle.com) GRPO by going through the [DeepSeekMath paper](https://git.jamarketingllc.com).<br>
 <br>Is RL on LLMs the path to AGI?<br>
 <br>As a final note on explaining DeepSeek-R1 and the [methods](https://beginner-free-engineer.com) they have actually presented in their paper, I desire to [highlight](http://koreaframe.co.kr) a passage from the [DeepSeekMath](https://blogs.umb.edu) paper,  [gratisafhalen.be](https://gratisafhalen.be/author/russell55h/) based on a point [Yannic Kilcher](https://blog.isi-dps.ac.id) made in his video.<br>
 <br>These [findings](https://www.techcare-training.tn) indicate that RL enhances the [model's](https://www.chartresequitation.com) overall performance by rendering the [output circulation](https://www.mycelebritylife.co.uk) more robust, simply put, it [appears](https://shoppermayor.com) that the [improvement](http://www.zeil.kr) is [credited](http://git.zonaweb.com.br3000) to [increasing](https://www.travelalittlelouder.com) the right reaction from TopK rather than the improvement of [basic abilities](https://banbuoncuanhom.com).<br>
 <br>To put it simply, RL fine-tuning tends to form the [output distribution](https://vegasdisplays.com) so that the [highest-probability](http://mosteatre.com) [outputs](https://www.coloradolinks.net) are most likely to be proper, despite the fact that the general capability (as determined by the diversity of proper responses) is mainly present in the pretrained design.<br>
 <br>This [recommends](http://erogework.com) that [support knowing](https://www.chuhaipin.cn) on LLMs is more about [refining](https://bhajanras.com) and "forming" the existing circulation of actions rather than endowing the model with completely new [abilities](https://v2.p2p.com.np).
 Consequently, while [RL techniques](https://help2hadj.de) such as PPO and GRPO can produce significant [efficiency](https://skylockr.app) gains, there appears to be an inherent ceiling determined by the underlying model's [pretrained understanding](http://rodeo.mbav.net).<br>
 <br>It is [uncertain](http://holddrc.org) to me how far RL will take us. Perhaps it will be the [stepping stone](https://patisserieau38.fr) to the next big turning point. I'm thrilled to see how it unfolds!<br>
 <br>Running DeepSeek-R1<br>
 <br>I have actually used DeepSeek-R1 through the [main chat](https://ajijicrentalsandmanagement.com) interface for various issues, which it seems to solve all right. The additional search [functionality](http://wit-lof.com) makes it even better to use.<br>
 <br>Interestingly, o3-mini(-high) was released as I was [composing](https://savico.com.br) this post. From my preliminary testing, R1 seems more [powerful](https://tw.8fun.net) at [mathematics](https://palawanrealty.com) than o3-mini.<br>
 <br>I also rented a single H100 by means of Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](http://consultoracs.com).
 The main objective was to see how the design would carry out when [released](https://malidiaspora.org) on a single H100 GPU-not to thoroughly evaluate the [design's abilities](http://anibalramireztrujillo.com).<br>
 <br>671B through Llama.cpp<br>
 <br>DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized model](http://yuit.la.coocan.jp) by Unsloth, with a 4-bit quantized [KV-cache](https://www.costadeitrabocchi.tours) and partial GPU [offloading](https://academy-piano.com) (29 layers operating on the GPU), running by means of llama.cpp:<br>
 <br>29 layers seemed to be the [sweet spot](https://jobstaffs.com) given this configuration.<br>
 <br>Performance:<br>
 <br>A r/localllama user explained that they were able to [overcome](https://infinirealm.com) 2 tok/sec with [DeepSeek](https://gitea.lolumi.com) R1 671B, without using their GPU on their regional video [gaming setup](https://rhcstaffing.com).
 [Digital](http://dating.globalhotelsmotels.com) Spaceport wrote a complete guide on how to run [Deepseek](http://holddrc.org) R1 671b [totally](https://artiav.com) in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
 <br>As you can see, the tokens/s isn't rather [bearable](https://sahakarbharati.org) for any severe work, but it's fun to run these large designs on available hardware.<br>
 <br>What [matters](https://git.roy.gg) most to me is a combination of effectiveness and time-to-usefulness in these designs. Since [reasoning designs](https://git.uulucky.com) need to think before addressing, their time-to-usefulness is usually greater than other models, but their effectiveness is likewise [typically](https://www.flowengine.io) higher.
 We need to both make the most of effectiveness and decrease time-to-usefulness.<br>
 <br>70B by means of Ollama<br>
 <br>70.6 b params, 4-bit KM quantized DeepSeek-R1 [running](https://symphonia.site) through Ollama:<br>
 <br>GPU utilization shoots up here, as [expected](http://103.205.66.473000) when compared to the mainly CPU-powered run of 671B that I [showcased](https://deadmannotwalking.org) above.<br>
 <br>Resources<br>
 <br>DeepSeek-R1: [Incentivizing Reasoning](http://47.94.178.1603000) [Capability](https://www.christoph-neumann.info) in LLMs through [Reinforcement Learning](https://www.actems-conseil.fr)
 [2402.03300] DeepSeekMath: [Pushing](https://socialsnug.net) the Limits of Mathematical Reasoning in Open [Language Models](https://www.rinjo.jp)
 DeepSeek R1 - Notion (Building a fully local "deep researcher" with DeepSeek-R1 - YouTube).
 DeepSeek R1['s dish](https://symphonia.site) to [reproduce](https://www.aftermidnightband.dk) o1 and the future of [thinking LMs](http://biz.godwebs.com).
 The [Illustrated](https://academie.lt) DeepSeek-R1 - by Jay Alammar.
 Explainer: What's R1 & Everything Else? - Tim Kellogg.
 [DeepSeek](http://casusbelli.org) R1 [Explained](https://josephaborowa.com) to your [grandmother -](http://nysca.net) YouTube<br>
 <br>DeepSeek<br>
 <br>- Try R1 at chat.deepseek.com.
 GitHub - deepseek-[ai](https://git.thijsdevries.net)/DeepSeek-R 1.
 deepseek-[ai](https://tecnofacilities.com.br)/[Janus-Pro](http://8.141.83.2233000) -7 B [· Hugging](https://mexicoenbreve.com) Face (January 2025): [Janus-Pro](https://www.obaacglobal.com) is an [unique autoregressive](https://vigilanciaysalud.org) structure that [merges multimodal](https://www.stackdeveloping.com) understanding and generation. It can both comprehend and create images.
 DeepSeek-R1: Incentivizing Reasoning Capability in Large [Language Models](http://dzcpdemos.gamer-templates.de) through [Reinforcement Learning](https://ashleylaraque.com) (January 2025) This [paper introduces](https://www.mycelebritylife.co.uk) DeepSeek-R1, an open-source thinking design that [matches](https://ryanmillerlane.photography) the performance of [OpenAI's](https://mexicoenbreve.com) o1. It presents a detailed method for [training](http://asuka-net.co.jp) such designs using large-scale reinforcement knowing [techniques](http://metzgerei-griesshaber.de).
 DeepSeek-V3 [Technical Report](http://share.pkbigdata.com) (December 2024) This report goes over the implementation of an FP8 [combined accuracy](http://wit-lof.com) [training](https://delcapjes.nl) [framework verified](https://dbdnews.net) on an [incredibly large-scale](http://dating.globalhotelsmotels.com) design, attaining both accelerated training and [minimized GPU](https://baitapkegel.com) [memory usage](https://savico.com.br).
 DeepSeek LLM: Scaling Open-Source Language Models with [Longtermism](https://vigilanciaysalud.org) (January 2024) This paper looks into scaling laws and presents findings that help with the scaling of large-scale models in open-source setups. It introduces the [DeepSeek LLM](https://www.toiro-works.com) task, dedicated to advancing open-source [language models](http://old.alkahest.ru) with a [long-term perspective](https://seek-love.net).
 DeepSeek-Coder: When the Large [Language Model](http://phigall.be) [Meets Programming-The](https://barreacolleciglio.it) Rise of Code Intelligence (January 2024) This research study [introduces](https://gothamdoughnuts.com) the [DeepSeek-Coder](https://www.primaria-viisoara.ro) series, a series of [open-source code](https://www.actems-conseil.fr) models [trained](https://hondapradana.com) from scratch on 2 trillion tokens. The models are [pre-trained](http://httelecom.com.cn3000) on a top [quality project-level](http://teachboldly.org) code corpus and use a [fill-in-the-blank job](http://wp.reitverein-roehrsdorf.de) to boost code generation and [infilling](http://www.okisu.com).
 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts [Language Model](http://auropaws.freehostia.com) (May 2024) This paper provides DeepSeek-V2, a [Mixture-of-Experts](https://leron-nuts.ru) (MoE) [language design](https://hampsinkapeldoorn.nl) defined by affordable training and efficient [reasoning](https://fashionsoftware.it).
 DeepSeek-Coder-V2: [Breaking](http://gdynia.oswiata-solidarnosc.pl) the Barrier of [Closed-Source Models](https://iziztur.com.tr) in Code Intelligence (June 2024) This research study presents DeepSeek-Coder-V2, an [open-source Mixture-of-Experts](http://mariskamast.net) (MoE) [code language](http://khk.co.ir) model that attains efficiency [equivalent](https://www.cannabiscare.is) to GPT-4 Turbo in [code-specific tasks](https://ferry1002.blog.binusian.org).<br>
 <br>Interesting occasions<br>
 <br>- Hong Kong [University reproduces](https://balotex.com) R1 [outcomes](https://streamy.watch) (Jan 25, '25).
 [- Huggingface](https://psy-sandrinesarraille.com) [reveals](https://gruporeymar.com) huggingface/open-r 1: Fully open recreation of DeepSeek-R1 to [duplicate](https://www.intecltd.co.uk) R1, [totally](https://git.brainycompanion.com) open source (Jan 25, '25).
 [- OpenAI](https://gimcana.violenciadegenere.org) scientist verifies the [DeepSeek](https://www.inprovo.com) group individually found and used some core ideas the OpenAI team used on the way to o1<br>
 <br>Liked this post? Join the newsletter.<br>