When I teach, I use "big data" for data that won't fit in a single machine. "Sma...

ElectricalUnion · 2026-03-12T16:28:50 1773332930

The funny thing is that those days you can fit 64 TB of DDR5 in a single physical system (IBM Power Server), so almost all non data-lake-class data is "Small data".

AlotOfReading · 2026-03-12T17:04:52 1773335092

And a single machine can hold petabytes of disk for medium scale. There aren't many datasets exceeding that outside fundamental physics.

jandrewrogers · 2026-03-13T05:12:02 1773378722

> There aren't many datasets exceeding that outside fundamental physics.

Just about every physical world telemetry or sensing data source of any note will generate petabytes of analytical data model in hours to days. On the high end, there are single categories of data source that aggregate to more like an exabyte per day of high-value data.

It is a completely different standard of scale than web data. In many industrial domains the average small-to-medium sized company I come across retains tens of petabytes of data and it has been this way for many years. The prohibitive cost is the only thing keeping them for scaling even more.

The major issue is that the large-scale analytics infrastructure developed for web data are hopelessly inadequate.

MagicMoonlight · 2026-03-13T10:29:16 1773397756

You could generate PB of data from a random number generator.

My question would be, why does a company need PBs of sensor data? What justifies retaining so much? Surely you aren’t using it beyond the immediate present.

__mharrison__ · 2026-03-12T23:39:59 1773358799

There's nothing wrong with that. Small data is relative, and my clients often find it useful to rent or get access to beefy machines to process it with "small" techniques rather than use clusters...

ladberg · 2026-03-12T14:00:12 1773324012

I'm curious - what were you doing that polars was leaving a 40-80x speedup on the table? I've been happy with it's speed when held correctly, but it's certainly easy to hold it incorrectly and kill your perf if you're not careful

__mharrison__ · 2026-03-12T15:21:59 1773328919

20 year old BI app. Columnar DBs weren't really a thing. (MonetDB was brand new but not super stable. I committed the SQLAlchemy interface to it.)

beagle3 · 2026-03-13T06:38:21 1773383901

KDB v1 is from sometime in the late 1990’s (I met v2 in 2002; but v1 was internal use only at some investment bank).

But that follows A and A+ which were extremely column oriented and date to early 1990s or even late 1980s ; and to various APL implementations going back to the 1960’s

Columnar DBs were very much a thing among APL users (finance and operations research) but weren’t really known outside those fields - and even in those fields, there was a period of amnesia in the late ‘90s/early 2000’s

dartharva · 2026-03-12T15:20:29 1773328829

Might be tangential but in my recent experience polars kept crashing the python server with OOM errors whenever I tried to stream data from and into large parquet files with some basic grouping and aggregation.

Claude suggested to just use DuckDB instead and indeed, it made short work of it.

devnotes77 · 2026-03-12T14:08:33 1773324513

[flagged]

__mharrison__ · 2026-03-12T15:23:24 1773329004

App is now lazy!

crowcroft · 2026-03-12T21:24:14 1773350654

A bit of a moving target there, especially with the definition of medium data on disk considering the rise of high speed NVMe vs spinning metal. Makes me wonder if the 00s 'Big Data' era and the resulting infra is largely just outdated now...

3eb7988a1663 · 2026-03-13T04:12:11 1773375131

Always a classic: https://yourdatafitsinram.net/

Outside of the king's ransom you now have to pay for it, you can fit 99% of problems into RAM.