Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)

<- Back

Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)

tosh

Comments (190)

MarginalGainz
The saddest part about this article being from 2014 is that the situation has arguably gotten worse.We now have even more layers of abstraction (Airflow, dbt, Snowflake) applied to datasets that often fit entirely in RAM.I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'. The incentives are misaligned with efficiency.
adamdrake
Author here!It's great to see this post I wrote years ago still being useful for people.I agree with many here that the situation is arguably worse in many ways. However, along similar lines, I've been pleased to see a move away from cargo culting microservices (another topic I addressed in a separate post on that site).To all those helping companies and teams improve performance, keep it up! There is hope!
benrutter
This times a zillion! I think there's been a huge industry push to convince managers and more junior engineers that spark and distributed tools are the correct way to do data engineering.I think its a similar pattern to web dev influencers have convinced everyone to build huge hydrated-spa-framework-craziness where a static site would do.My advice to get out of this mess:- Managers, don't ask for specific solutions (spark, react). Ask for clever engineers to solve problems and optimise / track what you vare about (cost, performance etc). You hired them to know best, and they probably do.- Technical leads, if your manager is saying "what about hyperscale?" You don't have to say "our existing solution will scale forever". It's fine to say, "our pipelines handle dataset up to 20GB, we don't expect to see anything larger soon, and if we do we'll do x/y/z to meet that scale". Your manager probably just wants to know scaling isn't going to crash everything, not that you've optimised the hell out of everything for your excel spreadsheet processing pipeline.
rented_mule
A little bit of history related to the article for any who might be interested...mrjob, the tool mentioned in the article, has a local mode that does not use Hadoop, but just runs on the local computer. That mode is primarily for developing jobs you'll later run on a Hadoop cluster over more data. But, for smaller datasets, that local mode can be significantly faster than running on a cluster with Hadoop. That's especially true for transient AWS EMR clusters — for smaller jobs, local mode often finishes before the cluster is up and ready to start working.Even so, I bet the author's approach is still significantly faster than mrjob's local mode for that dataset. What MapReduce brought was a constrained computation model that made it easy to scale way up. That has trade-offs that typically aren't worth it if you don't need that scale. Scaling up here refers to data that wouldn't easily fit on disks of the day — the ability to seamlessly stream input/output data from/to S3 was powerful.I used mrjob a lot in the early 2010s — jobs that I worked on cumulatively processed many petabytes of data. What it enabled you to do, and how easy it was to do it, was pretty amazing when it was first released in 2010. But it hasn't been very relevant for a while now.
mbb70
The bigness of your data has always depended on the what you are doing with it.Consider the following table of medical surgeries: date,physician_name, surgery_name,success."What are the top 10 most common surgeries?" - easy in bash"Who are the top physicians (% success) in the last year for those surgeries?" - still easy in bash"Which surgeries are most affected by physician experience?" - very hard in bash, requires calculating for every surgery how many times that physician had performed that surgery on that day, then compare low and high experience outcomes.A researcher might see a smooth continuum of increasingly complex questions, but there are huge jumps in computational complexity. At 50gb dataset might be 'bigger' than a 2tb one if you are asking tough questions.It's easier for a business to say "we use Spark for data processing", than "we build bespoke processing engines on a case by case basis".
torginus
When I worked as a data engineer, I rewrote some Bash and Python scripts into C# that were previously processing gigabytes of JSON at 10s of MB/s - creating a huge bottleneck.By applying some trivial optimizations, like streaming the parsing, I essentially managed to get it to run at almost disk speed (1GB/s on an SSD back then).Just how much data do you need when these sort of clustered approaches really start to make sense?
paranoidrobot
A selection of times it's been previously posted:(2018, 222 comments) https://news.ycombinator.com/item?id=17135841(2022, 166 comments) https://news.ycombinator.com/item?id=30595026(2024, 139 comments) https://news.ycombinator.com/item?id=39136472 - by the same submitter as this post.
hmokiguess
Tangential, but this reminds of the older K website when it used to be shakti.com that had an intro like this in their about section:1K rows: use excel1M rows: use pandas/polars1B rows: use shakti1T rows: only shaktiSource: https://web.archive.org/web/20230331180931/https://shakti.co...
forinti
I think many devs learn the trade with Windows and don't get exposure to these tools.Plus, they require a bit of reading because they operate on a higher level of abstraction than loops and ifs. You get implicit loops, your fields get cut up automatically, and you can apply regexes simultaneously on all fields. So it's not obvious to the untrained eye.But you get a lot of power and flexibility on the cli, which enable you to rapidly put together an ad hoc solution which can get the job done or at least serve as a baseline before you reach for the big guns.
KolmogorovComp
> The first thing to do is get a lot of game data. This proved more difficult than I thought it would be, but after some looking around online I found a git repository on GitHub from rozim that had plenty of games. I used this to compile a set of 3.46GB of data, which is about twice what Tom used in his test. The next step is to get all that data into our pipeline.It would be interesting to redo the benchmark but with a (much) larger database.Nowadays the biggest open-data for chess must comes from Lichess https://database.lichess.org, with ~7B games and 2.34 TB compressed, ~14TB uncompressed.Would Hadoop win here?
phyzix5761
It’s easy to overlook how often straightforward approaches are the best fit when the data and problem are well understood. Large expensive tools can become problems in their own right creating complexity that then requires even more tooling to manage. (Maybe that's the intent?) The issue is that teams and companies often adopt optimization frameworks earlier than necessary. Starting with simpler tools can get you most of the way there and in many cases they turn out to be all that’s needed.
jeswin
The same thing is true with Sqlite vs Postgres. Most startups need Sqlite, not Postgres. Many queries run an order of magnitude faster. Not only is it better for your users, it's life changing to see the test suites (which would take minutes to run) complete in mere seconds
fmajid
I've contributed to PrestoDB, but the availability of DuckDB and fast multi core machines with even faster SSDs makes the need for distribution all the more niche, or even cargo-culting Google or Meta.
meken
I’m curious about the memory usage of the cat | grep part of the pipeline. I think the author is processing many small files?In which case it makes the analysis a bit less practical, since the main use case I have for fancy data processing tools is when I can’t load a whole big file into memory.
fifilura
No joins in that article?The comments here smell of "real engineers use command line". But I am not sure they ever actually worked with analysing data more than using it as a log parser.Yes Hadoop is 2014.These days you obviously don't set up a Hadoop cluster. You use the cloud provider service provided (BigQuery or AWS Athena for example).Or map your data into DuckDB or use polars if it is small.
srcreigh
MapReduce is from a world with slow HDDs, expensive ram, expensive enterprise class servers, fast network.In that case to get best performance, you’d have to shard your data across a cluster and use mapreduce.Even in the authors 2014 SSDs multi-core consumer PC world, their aggregate pipeline would be around 2x faster if the work was split across two equivalent machines.The limit of how much faster distributed computing is comes down to latency more than throughput. I’d not be surprised if this aggregate query could run in 10ms on pre sharded data in a distributed cluster.
jgord
highly recommend xsv by BurntSushi [ csv parser / wrangler written in rust ]
ejoebstl
Great article. Hadoop (and other similar tools) are for datasets so huge they don't fit on one machine.
EdwardCoffin
This makes me think of Bane's rule, described in this comment here [1]:Bane's rule, you don't understand a distributed computing problem until you can get it to fit on a single machine first.[1] https://news.ycombinator.com/item?id=8902739
rcarmo
This has been a recurring theme for ages, with a few companies taking it to extremes—there are people transpiring COBOL to bash too…
nasretdinov
And now with things like DuckDB and clickhouse-local you won't have to worry about data processing performance ever again. Just kidding, but especially with ClickHouse it's so much better to handle the large data volume compared to the past, and even a single beefy server is often enough to satisfy all data analytics needs for a moderate-to-large company.
jonathanhefner
And since AI agents are extremely good at using them, command-line tools are also probably 235x more effective for your data science needs.
killingtime74
Hadoop, blast from the past
anon
undefined
olq_plo
And now you can do this with polars in parallel on all your cores and the GPU, using almost the same syntax as in pyspark.
jeffbee
Something to note here is that the result of xargs -P is unlikely to be satisfactory, since all of the subprocesses are simply connected to the terminal and stomp over each other's outputs. A better choice would be something like rush or, for the Perl fans, parallel.
cryptoboy2283
Earlier in 2010 - http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...