<- Back
Comments (82)
- brentrooseA month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds. This optimization process with so much fun, and so many people pitched in with their ideas; so I eventually decided I wanted to do something more.That's why I built a performance challenge for the PHP communityThe goal of this challenge is to parse 100 million rows of data with PHP, as efficiently as possible. The challenge will run for about two weeks, and at the end there are some prizes for the best entries (amongst the prize is the very sought-after PhpStorm Elephpant, of which we only have a handful left).I hope people will have fun with it :)
- XeoncrossThis is why I jumped from PHP to Go, then why I jumped from Go to Rust.Go is the most battery-included language I've ever used. Instant compile times means I can run tests bound to ctrl/cmd+s every time I save the file. It's more performant (way less memory, similar CPU time) than C# or Java (and certainly all the scripting languages) and contains a massive stdlib for anything you could want to do. It's what scripting languages should have been. Anyone can read it just like Python.Rust takes the last 20% I couldn't get in a GC language and removes it. Sure, it's syntax doesn't make sense to an outsider and you end up with 3rd party packages for a lot of things, but can't beat it's performance and safety. Removes a whole lot of tests as those situations just aren't possible.If Rust scares you use Go. If Go scares you use Rust.
- pxtailSide note - I wasn't aware that there is active collectors scene for Elephpants, awesome!https://elephpant.me/
- semiquaverAre they just confused about what characters require escaping in JSON strings or is PHP weirder than I remember? { "\/blog\/11-million-rows-in-seconds": { "2025-01-24": 1, "2026-01-24": 2 }, "\/blog\/php-enums": { "2024-01-24": 1 } }
- chrismarlow9I don't have time to put together a submission but I'm willing to bet you can use this:https://github.com/kjdev/php-ext-jqAnd replicate this command:jq -R ' [inputs | split(",") | {url: .[0], date: .[1] | split("T")[0]}] | group_by(.url) | map({ (.[0].url): ( map(.date) | group_by(.) | map({(.[0]): length}) | add ) }) | add ' < test-data.csvAnd it will be faster than anything you can do in native phpEdit: I'm assuming none of the urls have a comma with this but it's more about offloading it through an extension, even if you custom built it
- tveita> Also, the generator will use a seeded randomizer so that, for local development, you work on the same dataset as othersExcept that the generator script generates dates relative to time() ?
- matei88It reminds me of a good read about optimizing PHP for 1 billion rows challenge. TLDR; at some point you hit a limit in PHP’s stream layerhttps://dev.to/realflowcontrol/processing-one-billion-rows-i...
- csjhObligatory DuckDB solution:> duckdb -s "COPY (SELECT url[20:] as url, date, count(*) as c FROM read_csv('data.csv', columns = { 'url': 'VARCHAR', 'date': 'DATE' }) GROUP BY url, date) TO 'output.json' (ARRAY)"Takes about 8 seconds on my M1 Macbook. JSON not in the right format, but that wouldn't dominate the execution time.
- Retr0idHow large is a sample 100M row file in bytes? (I tried to run the generator locally but my php is not bleeding-edge enough)
- poizan42> The output should be encoded as a pretty JSON string....> Your parser should store the following output in $outputPath as a JSON file: { "\/blog\/11-million-rows-in-seconds": { "2025-01-24": 1, "2026-01-24": 2 }, "\/blog\/php-enums": { "2024-01-24": 1 } } They don't define what exactly "pretty" means, but superflous escapes are not very pretty in my opinion.
- spiderfarmerAwesome. I’ll be following this. I’ll probably learn a ton.
- wangzhongwang[dead]
- tomaytotomatoTempted to submit a Java app wrapped in PHP exec() :D