• About
  • Landing Page
  • Buy JNews
Newsletter
Impact Crypto News
Advertisement
  • Home
  • DeFi News
  • EVM News
    • Avalanche Network
    • Ethereum
    • Fantom Opera Chain
    • Harmony Chain
    • Huobi Eco Chain
    • Polkadot Chain
    • Polygon Chain
  • NFT News
  • Altcoin News
  • Crypto News
    • Crypto Regulation News
    • Bitcoin
    • Blockchain
    • Crypto Exchanges
    • Crypto Mining
    • Metaverse
    • Scam News
    • Web 3.0
No Result
View All Result
  • Home
  • DeFi News
  • EVM News
    • Avalanche Network
    • Ethereum
    • Fantom Opera Chain
    • Harmony Chain
    • Huobi Eco Chain
    • Polkadot Chain
    • Polygon Chain
  • NFT News
  • Altcoin News
  • Crypto News
    • Crypto Regulation News
    • Bitcoin
    • Blockchain
    • Crypto Exchanges
    • Crypto Mining
    • Metaverse
    • Scam News
    • Web 3.0
No Result
View All Result
Impact Crypto News
No Result
View All Result
Home Crypto News Blockchain

NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training

IMPACTCRYPTO by IMPACTCRYPTO
May 7, 2025
in Blockchain
57 1
0
NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training
189
SHARES
1.5k
VIEWS
Share on FacebookShare on Twitter




Joerg Hiller
May 07, 2025 15:38

NVIDIA introduces Nemotron-CC, a trillion-token dataset for large language models, integrated with NeMo Curator. This innovative pipeline optimizes data quality and quantity for superior AI model training.



NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training

NVIDIA has integrated its Nemotron-CC pipeline into the NeMo Curator, offering a groundbreaking approach to curating high-quality datasets for large language models (LLMs). The Nemotron-CC dataset leverages a 6.3-trillion-token English language collection from Common Crawl, aiming to enhance the accuracy of LLMs significantly, according to NVIDIA.

Advancements in Data Curation

The Nemotron-CC pipeline addresses the limitations of traditional data curation methods, which often discard potentially useful data due to heuristic filtering. By employing classifier ensembling and synthetic data rephrasing, the pipeline generates 2 trillion tokens of high-quality synthetic data, recovering up to 90% of content lost by filtering.

Innovative Pipeline Features

The pipeline’s data curation process begins with HTML-to-text extraction using tools like jusText and FastText for language identification. It then applies deduplication to remove redundant data, utilizing NVIDIA RAPIDS libraries for efficient processing. The process includes 28 heuristic filters to ensure data quality and a PerplexityFilter module for further refinement.

Quality labeling is achieved through an ensemble of classifiers that assess and categorize documents into quality levels, facilitating targeted synthetic data generation. This approach enables the creation of diverse QA pairs, distilled content, and organized knowledge lists from the text.

Impact on LLM Training

Training LLMs with the Nemotron-CC dataset yields significant improvements. For instance, a Llama 3.1 model trained on a 1 trillion-token subset of Nemotron-CC achieved a 5.6-point increase in the MMLU score compared to models trained on traditional datasets. Furthermore, models trained on long horizon tokens, including Nemotron-CC, saw a 5-point boost in benchmark scores.

Getting Started with Nemotron-CC

The Nemotron-CC pipeline is available for developers aiming to pretrain foundation models or perform domain-adaptive pretraining across various fields. NVIDIA provides a step-by-step tutorial and APIs for customization, enabling users to optimize the pipeline for specific needs. The integration into NeMo Curator allows for seamless development of both pretraining and fine-tuning datasets.

For more information, visit the NVIDIA blog.

Image source: Shutterstock




Source link

Related articles

Is Bitcoin going to K? This analyst thinks so

Is Bitcoin going to $40K? This analyst thinks so

December 15, 2025
AAVE Price Prediction: Testing 5-225 Resistance Zone in Next 30 Days

AAVE Price Prediction: Testing $215-225 Resistance Zone in Next 30 Days

December 15, 2025
Tags: bitcoin newscrypto analysiscrypto newsDatasetEnhancedEthoz EdgeLatest bitcoin newslatest crypto newsLLMNemotronCCNvidiaTrainingTrillionTokenUnveils
Share76Tweet47

Related Posts

Is Bitcoin going to K? This analyst thinks so

Is Bitcoin going to $40K? This analyst thinks so

by IMPACTCRYPTO
December 15, 2025
0

The future of sending money starts here: Ogvio is LIVE! Experience instant borderless transfers with no hidden fees 💸 Sign...

AAVE Price Prediction: Testing 5-225 Resistance Zone in Next 30 Days

AAVE Price Prediction: Testing $215-225 Resistance Zone in Next 30 Days

by IMPACTCRYPTO
December 15, 2025
0

Zach Anderson Dec 15, 2025 12:04 AAVE price prediction points to potential recovery toward $215-225 medium-term...

LDO Price Prediction: Targeting alt=

LDO Price Prediction: Targeting $0.75-$1.27 Recovery Within 4-6 Weeks

by IMPACTCRYPTO
December 13, 2025
0

Peter Zhang Dec 13, 2025 17:18 LDO price prediction points to $0.75-$1.27 upside potential as technical...

Phantom Wallet Opens the Door to Regulated Event Trading

Phantom Wallet Opens the Door to Regulated Event Trading

by IMPACTCRYPTO
December 12, 2025
0

Enjoyed this article? Share it with your friends! A new feature became available in the Phantom crypto wallet on December...

Pakistan Clears Binance, HTX for Crypto Licensing Path

Pakistan Clears Binance, HTX for Crypto Licensing Path

by IMPACTCRYPTO
December 12, 2025
0

Enjoyed this article? Share it with your friends! Pakistan's virtual assets regulator has approved Binance $3.21B and HTX to begin...

Load More

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
Please enter CoinGecko Free Api Key to get this plugin works.
No Result
View All Result
  • Home
  • DeFi News
  • EVM News
    • Avalanche Network
    • Ethereum
    • Fantom Opera Chain
    • Harmony Chain
    • Huobi Eco Chain
    • Polkadot Chain
    • Polygon Chain
  • NFT News
  • Altcoin News
  • Crypto News
    • Crypto Regulation News
    • Bitcoin
    • Blockchain
    • Crypto Exchanges
    • Crypto Mining
    • Metaverse
    • Scam News
    • Web 3.0

© 2018 JNews by Jegtheme.