Business

3 minute read

Meta’s new Megabyte system solves one of the biggest roadblocks for GPTs

May 27, 2023

Meta AI recently published pre-print research showing off a radical new “Megabyte” framework for building generative pre-trained transformer (GPT) systems.

Dubbed “promising” by OpenAI’s Andrej Karpathy, former director of artificial intelligence at Tesla, the new architecture is designed to process large volumes of data — such as images, novels and video files — without the use of a process known as tokenization.

Promising. Everyone should hope that we can throw away tokenization in LLMs. Doing so naively creates (byte-level) sequences that are too long, so the devil is in the details.

Tokenization means that LLMs are not actually fully end-to-end. There is a whole separate stage with… https://t.co/t240ZPxPm7

— Andrej Karpathy (@karpathy) May 15, 2023

Tokenization is a lossy process that’s comparable to file compression. To process large amounts of data, GPT models convert bytes to tokens. The tokens are then processed by the transformer and used to generate output tokens, which are then decoded.

The tokenization process allows an AI system to process larger strings of data as numbers. The words “my favorite color is red,” if processed by OpenAI’s ChatGPT, for example, would be converted to the token string “3666, 4004, 3124, 318, 2266, 13” for processing.

*OpenAI demonstration of tokenization process. Source: OpenAI*

Unfortunately, even through tokenization, the amount of data current state-of-the-art systems can process still has a hard limit. For GPT-3.5, the limit is slightly over 4,000 tokens or about 3,000 words, whereas GPT-4 maxes out at around 32,000 tokens or about 24,000 words.

Meta’s new Megabyte system ditches tokenization in favor of a novel multi-layer prediction architecture capable of end-to-end modeling over 1 million bytes of data.

Most standard English-language encoding systems use standard 8-bit encoding. In this paradigm, each character takes up one byte of data. Therefore, an AI system capable of processing 1 million bytes of data without tokenization could work with text documents containing 750,000 words — a 3,025% increase over GPT-4.

For comparison, GPT-4 can currently handle about 10 feature-length news articles in a single prompt, whereas Megabyte would be able to parse the entirety of Leo Tolstoy’s War and Peace plus another two average-length novels.

Meta’s Megabyte model also performed well on ImageNet tests and benchmarks related to processing audio files, either equaling or surpassing existing byte-based transformer models such as DeepMind’s Perciever AR on both:

“Megabyte matches the state-of-the-art performance of PerceiverAR whilst using only half the compute.”

The implications of this research could be far-reaching. Tokenization is considered a roadblock in the field due to its hard data limits and the amount of energy and time required to train systems.

Without tokenization, it should be possible to train AI models with stronger foundational support for non-English languages, especially those that can’t be easily encoded in standard 8-bit characters.

This could lead to the further democratization of these technologies and enable everything from cryptocurrency trading bots to decentralized autonomous organization technologies to be built in native language codes around the world.

It would also increase the capacity of models like ChatGPT to work with image, video and audio files by generating multimedia clips using around the same time and energy consumption as text.

Sourced from cointelegraph.com.

Written by Tristan Greene on 2023-05-27 21:20:34.

Major DeFi Platform Comes to Cardano (ADA) With Native Sidechain

May 27, 2023

XRP Scammers Hijack DidYouKnowGaming? YouTube Channel

May 27, 2023

Hand-Picked Top-Read Stories

A Major Leap Towards Next-Gen DeFi Solutions

Turbos Finance pioneers new liquidity strategies for Sui

PWN Announces Integration with Beefy to Offer Yield-Bearing Assets

Trending Tags

Meta’s new Megabyte system solves one of the biggest roadblocks for GPTs

Leave a Reply Cancel reply

Previous Post

Major DeFi Platform Comes to Cardano (ADA) With Native Sidechain

Next Post

XRP Scammers Hijack DidYouKnowGaming? YouTube Channel

A Major Leap Towards Next-Gen DeFi Solutions

Turbos Finance pioneers new liquidity strategies for Sui

PWN Announces Integration with Beefy to Offer Yield-Bearing Assets

RWA protocol Untangled Finance debuts private credit pool on Celo

Synethix founder Kain Warwick targets mid-May launch for Infinex DEX

DeFi’s total value locked falls $10 billion in April

DeFi’s $336 Million Security Gap: Can Open-Source Dream Survive?

DeFi lending giant Aave unveils V4 protocol overhaul

Meta’s new Megabyte system solves one of the biggest roadblocks for GPTs

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts