<- Back
Comments (70)
- kstenerudThe problem is that this breaks down once you try to use SIMD instructions. I'd developed a similar kind of approach to encoding integers (and ieee774 floats) a couple of years ago (first byte encodes length and first bit of data: https://github.com/kstenerud/bonjson/blob/05b91f6fe7d6b07186... ). It was very clever and used compiler intrinsics to get the length in 1 instruction, so 2 instructions got you the final value, with no branches.But testing proved that when you move to SIMD instructions, ULEB128 (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#ty...) or sentinel values (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#lo...) win every time because of the parallelization opportunities.The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.
- wahernThis reminded me of ISO 7816-4 BER-TLV encodings, which uses the format defined in ISO/IEC 8825-1 (ASN.1 related spec). Length integer values of 0-127 are encoded in 1 byte. If the high bit is set, then the first 7 bits tell you the number of subsequent octets. So there's no offsetting involved, making it slightly less compact, but also dead simple.EDIT: BUT, BER-TLV does permit overlong encodings. And I once found and reported a Yubikey 4 bug related to this. My source code comment for the workaround: -- The Yubikey 4 has an off-by-one bug which -- declares tag length of 255 (for the 0x53 outer -- tag of a certficate DO) when there are only 254 -- bytes remaining in the reply. The reply is -- chained across two packets, but the off-by-one is -- probably related to the over-long encoded length -- (0x82 0x00 0xff instead of 0x81 0xff). -- -- [snip packet captures] -- -- Yubico's ykpiv_fetch_object function in ykpiv.c -- (confirmed 1.4.3-1.5.0) contains a read (memmove) -- overflow when the declared inner BER-TLV length -- (of the 0x53 tag) is longer than what was -- received over the wire. That makes Yubico's -- library oblivious to the issue. Relatedly, the -- set_length function has an off-by-one bug (length -- < 0xff instead of length <= 0xff) which produces -- an over-long encoded length. That doesn't by -- itself explain why the Yubikey 4 transmits a -- truncated logical reply unless the same code is -- being used.
- i2talicsNon-canonical encodings are actually quite useful for some applications that need variable length integers. DWARF and WASM both use LEB128.The problem is linking: a compiler needs to emit code into independent translation units, which contain "missing" references to symbols in other translation units, without yet knowing where all the code will end up in the final executable. Since we don't know where the location of other code is yet, we don't know how big the number representing that location is yet, which means that we don't know how wide the variable length encoding of that number will be. If the width changes after linking, then we have to push around the surrounding code to make space for the wider integer. Unfortunately, this changes the location of all the surrounding code, so we have to recompute all the references!The solution is to always emit un-linked var ints in the widest possible encoding (5 bytes for LEB128) that way when the references are patched during linking, no code is moved around. All integers can be converted to a non-canonical 5 byte form that is "wasteful" but its a worthwhile tradeoff because it solves this issue. Other integers that don't need to be linked can be packed in a smaller var int form to save space.
- stebalienI've used LEB128 (with canonicalisation) extensively and... this looks so much nicer for most use-cases (length prefixed, supports the full uint64 range without that extra 10th byte).The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you're using these numbers as tags/identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 <= 2 byte numbers.[1]: https://github.com/multiformats/multicodec
- juancnI like the denormalization of VLE ints (with or without zig-zag encoding of negatives), it helps support out of band information, such as nulls and other signals in serialization protocols with minimal overhead.For example you can use a denormalized zero to signal null.You can still define a canonical encoding where denormalizations have specific meaning or signal an error.
- boricjI'm working on a C++ library at work that binds wire data and application data through token and model layers, which includes among other things a fair amount of tokenizers/composers for various formats (JSON, CBOR, BSON, CSV...).This looks neat, but if encoding/decoding performance is important, payload size isn't and the integer is bounded, I would just put a fixed-size integer into the payload as-is.LEB128 (and JSON for that matter) can encode integer values of arbitrary length. This doesn't, which may or may not be important but it's different.I'll admit that I do not do any cryptographic work with my library and therefore canonical representations aren't a huge concern in my use-cases. I merely provide various configurable limits (max value length, max depth, max items per collection) in an effort to prevent infinitely long documents from hogging my tokenizers indefinitely.
- omoikaneUTF-8 has the same issue ("overlong encoding") where multiple representations are possible the same code point. Someone proposed a similar tweak to remove the overlapping ranges by adjusting the base offset for byte sequences that are longer than 1. That was discussed here:https://news.ycombinator.com/item?id=44456073 - Corrected UTF-8 (2025-07-03, 54 comments)This "corrected UTF-8" has other problems, but I thought it's interesting how the shifted-offset idea carries over.
- billpgI forget where I encountered it, but I've seen similar encodings that eliminated the possibility of many possible encodings for the same number by making the length part of the value.Values 0-127 are a single byte, but if that first byte has the continuation bit set, not only does that indicate the next byte has 7 more bits to contribute, it also moves the base up to the next window.10000000 00000000 is the only way to represent 128.10000000 10000000 00000000 is the only way to represent 16512.Does this encoding have a name?
- arkenflameI researched many different varint encodings for a GraphQL-specific binary format (resulting in Argo: https://github.com/msolomon/argo ). I ended up choosing protobuf-style zig-zag varints, but I also found these interesting:vu128: https://john-millikin.com/vu128-efficient-variable-length-in...metric/imperial varint: https://dcreager.net/2021/03/a-better-varint/vectorizing VByte: https://arxiv.org/abs/1503.07387
- conaclosThis is pretty close to SQLite's varints [0][0]: https://www.sqlite.org/src4/doc/1433690d7b/www/varint.wiki
- willtemperleyMaybe someone can explain why an encoder would ever create the padding bytes allowed in LEB128. I contributed the parser for LEB128 in apple/swift-binary-parsing and I’m still none the wiser. I’m genuinely mystified.
- aDyslecticCrowClever, but one thought crossed my mind;An adveserial package can claim to have a 255 tagged integer but not actually have any followup, tricking the payload parser into an incorrect offset and reading straight off into followup memory.It's a classic thing to check for when dealing with variable length strings or binary, but it may not cross the mind when it's hiding in the Bijou64_decode(*buff, *cr) function.
- nine_kIn short: instead of a truly indefinite-length solution with a signal bit on the current byte saying whether to check the next byte, this uses a counter. Values 0x0 to 0xF7 are one-byte integers, 0xF8 to 0xFF use the upper 5 bits as a counter for the number of subsequent bytes. This limits the maximum magnitude to slightly less than 2 ^ 264 (almost all 33-byte values), which seems to be okay for practical computations. The proposed standard limits the supported size to u64 though.The upsides: the size of the integer is apparent upon reading the first byte, and every number has exactly one canonical representation. I wish C strings had been standardized around something similar, instead on null termination.> ...adversarial input, which is rarely in the test suite.This made my scratch my head. My tests for quite pedestrian APIs often contain adversarial input of obvious shapes. I though that for anything security-related (like the author's project) testing against adversarial input would be be a prominent part.
- harrisiI'm surprised there's no mention in the post or here about SQLite's varint encoding. Not that it would necessarily satisfy the constraints, but it's one of the most used varint implementations.
- michaelmureOne nice upside of having a single way to encode a value is fuzzing: when you work on an encoder/decoder, you can use a fuzzer and do round-trip comparison until you find crashes or inputs/outputs that don't match (and therefore issues in the code). But with LEB128 for example, the fuzzer quickly learn about those alternatives encoding and there is not much you can do from there.
- apitmanThis reminds me of the varint encoding used by QUIC, but I've never implemented it. Anyone know the differences?
- yreadWouldn't something like this also work: https://en.wikipedia.org/wiki/Elias_omega_coding I've used to great effect for compression
- HansHamsterIt feels a bit unfair to say that it is faster by being able to tell the total length from the first byte and capping it at 64 bit, while some of the other formats can store arbitrarily large integers. I guess you could use another variable length encoding for the prefix at the cost of some performance and using even more space...
- MarkusQ> This causes problems for signed dataGiven that the context up to this point had been representation of integers, I initially trip on this. :)
- cantalopesI love the random hyperlink underlines on that page
- amlutoJust a quick reminder:> This causes problems for signed data if you ever want to do things like compression since you need to know the exact bytes that were signed.If you are verifying a signature by taking some logical data structure, turning it into a byte string, and calling the verification primitive on those bytes, you likely have a design error. You should instead collect bytes, verify the signature, and then parse the bytes after verifying the signature. And remember to include enough context in those bytes so a different message signed for a different purpose by the same key doesn’t confuse you.
- alex-reyss[dead]
- RedShift1This seems quite convoluted just to avoid the "0 can be represented in more than one way" problem.