When I teach, I use "big data" for data that won't fit in a single machine. "Small data" fits on a single machine in memory and medium data on disk.
Having said that duckDB is awesome. I recently ported a 20 year old Python app to modern Python. I made the backend swappable, polars or duckdb. Got a 40-80x speed improvement. Took 2 days.
The funny thing is that those days you can fit 64 TB of DDR5 in a single physical system (IBM Power Server), so almost all non data-lake-class data is "Small data".
> There aren't many datasets exceeding that outside fundamental physics.
Just about every physical world telemetry or sensing data source of any note will generate petabytes of analytical data model in hours to days. On the high end, there are single categories of data source that aggregate to more like an exabyte per day of high-value data.
It is a completely different standard of scale than web data. In many industrial domains the average small-to-medium sized company I come across retains tens of petabytes of data and it has been this way for many years. The prohibitive cost is the only thing keeping them for scaling even more.
The major issue is that the large-scale analytics infrastructure developed for web data are hopelessly inadequate.
You could generate PB of data from a random number generator.
My question would be, why does a company need PBs of sensor data? What justifies retaining so much? Surely you aren’t using it beyond the immediate present.
There's nothing wrong with that. Small data is relative, and my clients often find it useful to rent or get access to beefy machines to process it with "small" techniques rather than use clusters...
I'm curious - what were you doing that polars was leaving a 40-80x speedup on the table? I've been happy with it's speed when held correctly, but it's certainly easy to hold it incorrectly and kill your perf if you're not careful
KDB v1 is from sometime in the late 1990’s (I met v2 in 2002; but v1 was internal use only at some investment bank).
But that follows A and A+ which were extremely column oriented and date to early 1990s or even late 1980s ; and to various APL implementations going back to the 1960’s
Columnar DBs were very much a thing among APL users (finance and operations research) but weren’t really known outside those fields - and even in those fields, there was a period of amnesia in the late ‘90s/early 2000’s
Might be tangential but in my recent experience polars kept crashing the python server with OOM errors whenever I tried to stream data from and into large parquet files with some basic grouping and aggregation.
Claude suggested to just use DuckDB instead and indeed, it made short work of it.
A bit of a moving target there, especially with the definition of medium data on disk considering the rise of high speed NVMe vs spinning metal. Makes me wonder if the 00s 'Big Data' era and the resulting infra is largely just outdated now...
Having said that duckDB is awesome. I recently ported a 20 year old Python app to modern Python. I made the backend swappable, polars or duckdb. Got a 40-80x speed improvement. Took 2 days.