Small Data, Big Compute

https://www.moderndescartes.com/essays/small_data_big_compute/

Originally posted 2024-03-31

Tagged: software engineering, machine learning, strategy, popular ⭐️

Obligatory disclaimer: all opinions are mine and not of my employer


LLMs are really expensive to run, computationally speaking. I think you may be surprised by the order of magnitude difference.

While working at Lilac, I coined a phrase “small data, big compute” to describe this pattern, and used it to drive engineering decisions.

Arithmetic intensity

Arithmetic intensity is a concept popularized by NVidia that measures a very simple ratio: how many arithmetic operations are executed per byte transferred?

Consider a basic business analyst query: SELECT SUM(sales_amount) FROM table WHERE time < end_range AND time >= start_range. This query executes 1 addition for each 4-byte floating point number it processes, for an arithmetic intensity of 0.25. However, the bytes corresponding to sales_amount are usually interleaved with the bytes for time and row_id and everything else in the table, so only 1-10% of the bits read from disk are actually relevant to the calculation, for a net arithmetic intensity of 0.01.

Is 0.01 good or bad? Well, computers can read data from disk at roughly 1 GiB per second, or 250M floats per second; they can compute at roughly 8-16 FLOPs per cycle, with 3GHz clock speed = 25-50B float ops per second. Computers therefore have a 100:1 available ratio of compute to disk. Any code with an arithmetic intensity of less than 100 is underutilizing the CPU.

In other words, your typical business analyst query is horrendously underutilizing the computer, by a factor of about 10,000x. This mismatch is why there exists a $100B market for database companies and technologies that can optimize these business queries (Spark, Parquet, Hadoop, MapReduce, Flume, etc.). They do so by using columnar databases and on-the-fly compression techniques like run-length encoding, bit-packing, and delta compression, which trade increased compute for more effective use of bandwidth. The result is blazing fast analytics queries that actually fully utilize the 100:1 available ratio of compute to disk.

How many FLOPs do we spend per byte of user data in an LLM? Well… consider the popular 7B model size. As a rough approximation, let’s say each parameter-byte interaction results in 1 FLOP, for an arithmetic intensity of \(10^{10}\) operations per byte processed. Other larger LLMs can go to \(10^{13}\). You could quibble about bytes vs. tokens or multiply vs. add and the cost of exponentiation. But does it really matter if it’s 8 or 9 orders of magnitude more expensive per byte than the business analyst query? Convnets for image processing have an arithmetic intensity of \(10^4\) - \(10^5\). It’s large but not unreasonable, which is why they’ve found many applications in factory QC, agriculture, satellite imagery processing, etc..

Needless to say, this insane arithmetic intensity breaks just about every assumption and expectation that’s been baked into the way we think about software for the past twenty years.

Technical implications

No need for distributed systems

Unless you work at a handful of companies that train LLMs from scratch, you will not have the budget to operate LLMs on “big data”. A single 1TB harddrive can store enough text data to burn 10 million dollars in GPT4 API calls!

As a result, most business use cases for LLMs will inevitably operate on small data - say, <1 million rows.

The software industry has spent well over a decade learning how to build systems that scale across trillions of rows and thousands of machines, with the tradeoff that you would wait at least 30s per invocation. We got used to this inconvenience because it let us turn a 10 day single-threaded job into a 20 minute distributed job.

Now, faced with the daunting prospect of a mere 1 million rows, all of that is unnecessary complexity. Users deserve sub-second overheads when doing non-LLM computations on such small data. Lilac utilizes DuckDB to blast all cores on a single machine to compute basic summary statistics for every column in the user’s dataset, in less than a second - a luxury that we can afford because of small data!

Massive budget for bloat

Ordinarily, inefficiencies in per-item handling can add up to a significant cost. This includes things like network bandwidth/latency, preprocessing of data in a slow language like Python, HTTP request overhead, unnecessary dependencies, and so on.

LLMs are so expensive that everything else is peanuts. There is a lot more budget for slop and I fully expect businesses to use this budget. I am sorry to the people who are frustrated with the increasing bloat of the modern software stack - LLMs will bring on yet another expansionary era of bloat.

At Lilac, we ended up building a per-item progress saver into our dataset.map call, because it was honestly a small cost, relative to the fees that our users were incurring while making API calls to OpenAI. In comparison, HuggingFace’s dataset.map doesn’t implement checkpointing, because it would be an enormous waste of time and compute and disk space to checkpoint the result of a trivial arithmetic operation.

Latency-batching tradeoffs

GPU cores have a compute-memory bandwidth ratio of around 100 - they are not fundamentally different from computers in this regard. Ironically, LLMs end up bandwidth-limited despite the insane arithmetic intensity quoted above. If you also count the parameters of the model in the “bytes transferred” denominator, then LLM arithmetic intensity is roughly \(\frac{nm}{n + m}\), with n = input bytes and m = model bytes. Since \(m \gg n\), arithmetic intensity is proportional to \(n\). Increasing batch size is thus a free win, up to the point where the GPU is compute-bound rather than bandwidth-bound.

For real-time use cases like chatbots, scale is king! When you have thousands of queries per second, it becomes easy to wait 50 milliseconds for a batch of user queries to accumulate, and then execute them in a single batch. If you only have one query per second, you are in a situation where you will either get poor GPU utilization (expensive hardware goes to waste), or users will have to wait multiple seconds for enough accumulated queries to make a batch.

For offline use cases like document corpus embedding/transformation, we can automatically get full utilization through internal batching of the corpus. Because GPUs are the expensive part, I expect organizations to implement a queueing system to maximize usage of GPUs around the clock, possibly even intermingling offline jobs with real-time computation.

Minimal viable fine-tune

As a corollary of “compute cost dominates all”, any and all ways to optimize compute cost will be utilized. We will almost certainly see a relentless drive towards specialization of cheaper fine-tuned models for every conceivable use case. Stuff like speculative decoding shows just how expensive the largest LLMs are - you can productively run a smaller LLM to try and predict the larger LLM’s output, in real time!

In between engineering optimizations, fine-tuning/research breakthroughs, and increased availability of massively parallel hardware optimized for LLMs, the cost for any particular performance point will decrease significantly - some people claim 4x every year, which sounds aggressive but not even that unreasonable - 1.5x each from hardware, research, and engineering optimizations gets you close to ~4x.

I expect there to be a good business in drastically reducing compute costs by making it very easy to fine-tune a minimal viable model for a specific purpose.

Business implications

Data egress is not a moat

Cloud providers invest a lot of money into onboarding customers, with the knowledge that once they’re inside, it becomes very expensive to unwind all of the business integrations they’ve built. Furthermore, it becomes very expensive to even try to diversify into multiple clouds, because data egress outside of the cloud is stupidly expensive. This is all part of an intentional strategy to make switching harder.

Yet, the insane cost of LLMs means that data egress costs are a relatively small deal. 1GB of egress costs ~$0.10, while embedding 1GB worth of text would cost ~$50. As a result, I expect that…

A new GPU cloud will emerge

Because of the ease with which small data can flow between clouds, I expect a new cloud competitor, focused on cheap GPU compute. Scale will be king here, because increased scale results in negotiating power for GPU purchase contracts, investments into GPU reliability, investments into engineering tricks to maximize GPU utilization, improved latency for realtime applications, and investments into data/security/compliance certifications. Modal, Lambda, and NVidia seem like potential cloud winners here, but the truth is that we’re all winners, because relentless competition will drive down GPU costs for everyone.

Attack > defense

A certain class of user-generated content will become a Turing Arena of sorts, where LLMs will generate fake text (think Amazon product reviews or Google search result spam or Reddit commenter product/service endorsements), and LLMs will try to detect LLM-generated text. I think it’s a reasonable guess that LLMs will only be able to detect other LLMs of lesser quality.

Unfortunately for the internet, I think attack will win over the defense. The reason is safety in numbers.

A small number of attackers will have the resources to use the most expensive LLMs to generate the most realistic looking fake reviews, specifically in categories where the profit margins are highest (think “best hotel in manhattan” or “best machu picchu tour”). However, a much larger number of attackers will have moderate resources to use medium-sized LLMs to generate a much larger volume of semi-realistic fake reviews. The defense, on the other hand, has to scale up LLMs to run on all user-generated content, and realistically they will only be able to afford running medium or small LLMs to do so. Dan Luu’s logorrhea on the diseconomies of scale is exactly the right way to think here.

Conclusion

“Small data, big compute” allowed us to optimize for a certain class of dataset and take certain shortcuts. The Lilac team will be joining Databricks and I look forward to continuing to build systems tailored to the unusual needs of LLMs!

{
"by": "brilee",
"descendants": 0,
"id": 40248113,
"score": 1,
"time": 1714746734,
"title": "Small Data, Big Compute",
"type": "story",
"url": "https://www.moderndescartes.com/essays/small_data_big_compute/"
}
{
"author": null,
"date": null,
"description": null,
"image": null,
"logo": null,
"publisher": null,
"title": "Small Data, Big Compute",
"url": "https://www.moderndescartes.com/essays/small_data_big_compute/"
}
{
"url": "https://www.moderndescartes.com/essays/small_data_big_compute/",
"title": "Small Data, Big Compute",
"description": "Originally posted 2024-03-31 Tagged: software engineering, machine learning, strategy, popular ⭐️ Obligatory disclaimer: all opinions are mine and not of my employer LLMs are really expensive to run,...",
"links": [
"https://www.moderndescartes.com/essays/small_data_big_compute/"
],
"image": "",
"content": "<div>\n<p> Originally posted 2024-03-31</p>\n<p> Tagged: <a target=\"_blank\" href=\"https://www.moderndescartes.com/essays/tags/software_engineering\">software engineering</a>, <a target=\"_blank\" href=\"https://www.moderndescartes.com/essays/tags/machine_learning\">machine learning</a>, <a target=\"_blank\" href=\"https://www.moderndescartes.com/essays/tags/strategy\">strategy</a>, <a target=\"_blank\" href=\"https://www.moderndescartes.com/essays/tags/popular\">popular ⭐️</a></p>\n<p> <em>Obligatory disclaimer: all opinions are mine and not of my employer </em></p>\n<hr />\n<p>LLMs are really expensive to run, computationally speaking. I think\nyou may be surprised by the order of magnitude difference.</p>\n<p>While working at Lilac, I coined a phrase “small data, big compute”\nto describe this pattern, and used it to drive engineering\ndecisions.</p>\n<h2 id=\"arithmetic-intensity\">Arithmetic intensity</h2>\n<p>Arithmetic intensity is a concept <a target=\"_blank\" href=\"https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#understand-perf\">popularized\nby NVidia</a> that measures a very simple ratio: how many arithmetic\noperations are executed per byte transferred?</p>\n<p>Consider a basic business analyst query:\n<code>SELECT SUM(sales_amount) FROM table WHERE time &lt; end_range AND time &gt;= start_range</code>.\nThis query executes 1 addition for each 4-byte floating point number it\nprocesses, for an arithmetic intensity of 0.25. However, the bytes\ncorresponding to <code>sales_amount</code> are usually interleaved with\nthe bytes for <code>time</code> and <code>row_id</code> and everything\nelse in the table, so only 1-10% of the bits read from disk are actually\nrelevant to the calculation, for a net arithmetic intensity of 0.01.</p>\n<p>Is 0.01 good or bad? Well, computers can read data from disk at\nroughly 1 GiB per second, or 250M floats per second; they can compute at\nroughly 8-16 FLOPs per cycle, with 3GHz clock speed = 25-50B float ops\nper second. Computers therefore have a 100:1 available ratio of compute\nto disk. Any code with an arithmetic intensity of less than 100 is\nunderutilizing the CPU.</p>\n<p>In other words, your typical business analyst query is horrendously\nunderutilizing the computer, by a factor of about 10,000x. This mismatch\nis why there exists a $100B market for database companies and\ntechnologies that can optimize these business queries (Spark, Parquet,\nHadoop, MapReduce, Flume, etc.). They do so by using columnar databases\nand on-the-fly compression techniques like <a target=\"_blank\" href=\"https://duckdb.org/2022/10/28/lightweight-compression.html\">run-length\nencoding, bit-packing, and delta compression</a>, which trade increased\ncompute for more effective use of bandwidth. The result is blazing fast\nanalytics queries that actually fully utilize the 100:1 available ratio\nof compute to disk.</p>\n<p>How many FLOPs do we spend per byte of user data in an LLM? Well…\nconsider the popular 7B model size. As a rough approximation, let’s say\neach parameter-byte interaction results in 1 FLOP, for an arithmetic\nintensity of <span>\\(10^{10}\\)</span> operations per\nbyte processed. Other larger LLMs can go to <span>\\(10^{13}\\)</span>. You could quibble about bytes\nvs. tokens or multiply vs. add and the cost of exponentiation. But does\nit really matter if it’s 8 or 9 orders of magnitude more expensive per\nbyte than the business analyst query? Convnets for image processing have\nan arithmetic intensity of <span>\\(10^4\\)</span> -\n<span>\\(10^5\\)</span>. It’s large but not\nunreasonable, which is why they’ve found many applications in factory\nQC, agriculture, satellite imagery processing, etc..</p>\n<p>Needless to say, this insane arithmetic intensity breaks just about\nevery assumption and expectation that’s been baked into the way we think\nabout software for the past twenty years.</p>\n<h2 id=\"technical-implications\">Technical implications</h2>\n<h3 id=\"no-need-for-distributed-systems\">No need for distributed\nsystems</h3>\n<p>Unless you work at a handful of companies that train LLMs from\nscratch, you will not have the budget to operate LLMs on “big data”. A\nsingle 1TB harddrive can store enough text data to burn 10 million\ndollars in <a target=\"_blank\" href=\"https://help.openai.com/en/articles/7127956-how-much-does-gpt-4-cost\">GPT4\nAPI calls</a>!</p>\n<p>As a result, most business use cases for LLMs will inevitably operate\non small data - say, &lt;1 million rows.</p>\n<p>The software industry has spent well over a decade learning how to\nbuild systems that scale across trillions of rows and thousands of\nmachines, with the tradeoff that you would wait at least 30s per\ninvocation. We got used to this inconvenience because it let us turn a\n10 day single-threaded job into a 20 minute distributed job.</p>\n<p>Now, faced with the daunting prospect of a mere 1 million rows, all\nof that is unnecessary complexity. Users deserve sub-second overheads\nwhen doing non-LLM computations on such small data. Lilac utilizes\nDuckDB to blast all cores on a single machine to compute basic summary\nstatistics for every column in the user’s dataset, in less than a second\n- a luxury that we can afford because of small data!</p>\n<h3 id=\"massive-budget-for-bloat\">Massive budget for bloat</h3>\n<p>Ordinarily, inefficiencies in per-item handling can add up to a\nsignificant cost. This includes things like network bandwidth/latency,\npreprocessing of data in a slow language like Python, HTTP request\noverhead, unnecessary dependencies, and so on.</p>\n<p>LLMs are so expensive that everything else is peanuts. There is a lot\nmore budget for slop and I fully expect businesses to use this budget. I\nam sorry to the people who are frustrated with the increasing bloat of\nthe modern software stack - LLMs will bring on yet another expansionary\nera of bloat.</p>\n<p>At Lilac, we ended up building a per-item progress saver into our\n<code>dataset.map</code> call, because it was honestly a small cost,\nrelative to the fees that our users were incurring while making API\ncalls to OpenAI. In comparison, HuggingFace’s <code>dataset.map</code>\ndoesn’t implement checkpointing, because it would be an enormous waste\nof time and compute and disk space to checkpoint the result of a trivial\narithmetic operation.</p>\n<h3 id=\"latency-batching-tradeoffs\">Latency-batching tradeoffs</h3>\n<p>GPU cores have a compute-memory bandwidth ratio of around 100 - they\nare not fundamentally different from computers in this regard.\nIronically, LLMs end up bandwidth-limited despite the insane arithmetic\nintensity quoted above. If you also count the parameters of the model in\nthe “bytes transferred” denominator, then LLM arithmetic intensity is\nroughly <span>\\(\\frac{nm}{n + m}\\)</span>, with n =\ninput bytes and m = model bytes. Since <span>\\(m \\gg\nn\\)</span>, arithmetic intensity is proportional to <span>\\(n\\)</span>. Increasing batch size is thus a free\nwin, up to the point where the GPU is compute-bound rather than\nbandwidth-bound.</p>\n<p>For real-time use cases like chatbots, scale is king! When you have\nthousands of queries per second, it becomes easy to wait 50 milliseconds\nfor a batch of user queries to accumulate, and then execute them in a\nsingle batch. If you only have one query per second, you are in a\nsituation where you will either get poor GPU utilization (expensive\nhardware goes to waste), or users will have to wait multiple seconds for\nenough accumulated queries to make a batch.</p>\n<p>For offline use cases like document corpus embedding/transformation,\nwe can automatically get full utilization through internal batching of\nthe corpus. Because GPUs are the expensive part, I expect organizations\nto implement a queueing system to maximize usage of GPUs around the\nclock, possibly even intermingling offline jobs with real-time\ncomputation.</p>\n<h3 id=\"minimal-viable-fine-tune\">Minimal viable fine-tune</h3>\n<p>As a corollary of “compute cost dominates all”, any and all ways to\noptimize compute cost will be utilized. We will almost certainly see a\nrelentless drive towards specialization of cheaper fine-tuned models for\nevery conceivable use case. Stuff like <a target=\"_blank\" href=\"https://arxiv.org/abs/2302.01318\">speculative decoding</a> shows\njust how expensive the largest LLMs are - you can productively run a\nsmaller LLM to try and predict the larger LLM’s output, in real\ntime!</p>\n<p>In between engineering optimizations, fine-tuning/research\nbreakthroughs, and increased availability of massively parallel hardware\noptimized for LLMs, the cost for any particular performance point will\ndecrease significantly - some people claim 4x every year, which sounds\naggressive but not even that unreasonable - 1.5x each from hardware,\nresearch, and engineering optimizations gets you close to ~4x.</p>\n<p>I expect there to be a good business in drastically reducing compute\ncosts by making it very easy to fine-tune a minimal viable model for a\nspecific purpose.</p>\n<h2 id=\"business-implications\">Business implications</h2>\n<h3 id=\"data-egress-is-not-a-moat\">Data egress is not a moat</h3>\n<p>Cloud providers invest a lot of money into onboarding customers, with\nthe knowledge that once they’re inside, it becomes very expensive to\nunwind all of the business integrations they’ve built. Furthermore, it\nbecomes very expensive to even try to diversify into multiple clouds,\nbecause data egress outside of the cloud is <a target=\"_blank\" href=\"https://www.hostdime.com/blog/data-egress-fees-cloud/\">stupidly\nexpensive</a>. This is all part of an intentional strategy to make\nswitching harder.</p>\n<p>Yet, the insane cost of LLMs means that data egress costs are a\nrelatively small deal. 1GB of egress costs ~$0.10, while embedding 1GB\nworth of text would cost ~$50. As a result, I expect that…</p>\n<h3 id=\"a-new-gpu-cloud-will-emerge\">A new GPU cloud will emerge</h3>\n<p>Because of the ease with which small data can flow between clouds, I\nexpect a new cloud competitor, focused on cheap GPU compute. Scale will\nbe king here, because increased scale results in negotiating power for\nGPU purchase contracts, investments into GPU reliability, investments\ninto engineering tricks to maximize GPU utilization, improved latency\nfor realtime applications, and investments into data/security/compliance\ncertifications. Modal, Lambda, and NVidia seem like potential cloud\nwinners here, but the truth is that we’re all winners, because\nrelentless competition will drive down GPU costs for everyone.</p>\n<h3 id=\"attack-defense\">Attack &gt; defense</h3>\n<p>A certain class of user-generated content will become a Turing Arena\nof sorts, where LLMs will generate fake text (think Amazon product\nreviews or Google search result spam or Reddit commenter product/service\nendorsements), and LLMs will try to detect LLM-generated text. I think\nit’s a reasonable guess that LLMs will only be able to detect other LLMs\nof lesser quality.</p>\n<p>Unfortunately for the internet, I think attack will win over the\ndefense. The reason is safety in numbers.</p>\n<p>A small number of attackers will have the resources to use the most\nexpensive LLMs to generate the most realistic looking fake reviews,\nspecifically in categories where the profit margins are highest (think\n“best hotel in manhattan” or “best machu picchu tour”). However, a much\nlarger number of attackers will have moderate resources to use\nmedium-sized LLMs to generate a much larger volume of semi-realistic\nfake reviews. The defense, on the other hand, has to scale up LLMs to\nrun on all user-generated content, and realistically they will only be\nable to afford running medium or small LLMs to do so. Dan Luu’s <a target=\"_blank\" href=\"https://danluu.com/diseconomies-scale/\">logorrhea on the\ndiseconomies of scale</a> is exactly the right way to think here.</p>\n<h2 id=\"conclusion\">Conclusion</h2>\n<p>“Small data, big compute” allowed us to optimize for a certain class\nof dataset and take certain shortcuts. The Lilac team will be <a target=\"_blank\" href=\"https://www.databricks.com/blog/lilac-joins-databricks-simplify-unstructured-data-evaluation-generative-ai\">joining\nDatabricks</a> and I look forward to continuing to build systems\ntailored to the unusual needs of LLMs!</p>\n </div>",
"author": "",
"favicon": "https://www.moderndescartes.com/static/favicon.png",
"source": "moderndescartes.com",
"published": "",
"ttr": 334,
"type": ""
}