Need help?
<- Back

Comments (45)

  • fzxu22
    Working on this: https://github.com/KevinXuxuxu/anon_proxy, a sort of anonymization proxy to use with LLM providers. It does model (OpenAI privacy filter) + regex PII detection, and replaces them back-and-forth for API requests and responses. With locally hosted detection model, no PII leaves your local environment. I find it very useful especially when you're working on sensitive documents (legal, tax, immigration etc.), hope you find it helpful as well :)
  • stratos123
    There's some interesting technical details in this release:> Privacy Filter is a bidirectional token-classification model with span decoding. It begins from an autoregressive pretrained checkpoint and is then adapted into a token classifier over a fixed taxonomy of privacy labels. Instead of generating text token by token, it labels an input sequence in one pass and then decodes coherent spans with a constrained Viterbi procedure.> The released model has 1.5B total parameters with 50M active parameters.> [To build it] we converted a pretrained language model into a bidirectional token classifier by replacing the language modeling head with a token-classification head and post-training it with a supervised classification objective.
  • nl
    I'm no where near as smart as OpenAI of course, but I did build https://tools.nicklothian.com/webner/index.html that uses a BERT based named-entity-recognition model running in your browser to do a subset of PII redaction.It works pretty well for the use cases I was playing with.The OpenAI model is small enough that I might enhance my tool to use it.
  • aubinkure
    Exciting! I took a look through the code and found what appear to be the entity types for future releases - this release (V2 config) supports 8 entity types, but the V4 and V7 taxonomies have >20, mostly more personal ID types. Given this is a preview release, I imagine they'll release these.Details in my review article here: https://piieraser.ai/blog/openai-privacy-filter. Disclaimer: I also build PII detection systems.
  • maciejzj
    On a side note, when I click the link it redirects me to machine-translated version of OpenAI website with completely botched meaning - the word “redacted” is translated to a false friend “redagować” which means to edit/refine text, not anonymize.
  • mplanchard
    It would be nice if their examples weren’t mostly things that are easy to catch with regex, but it’s cool to see if released as an open, local model.
  • mayneack
    Curious how this compares to presidio which mixes regex with a model: https://microsoft.github.io/presidio/
  • usdogu
    Someone has created the reverse of it: https://github.com/chiefautism/privacy-parser
  • mentalgear
    SuperagentLM made available on-edge PPI redaction models already a few years ago in sizes 20B, 3B, 200M. They still seem to be available via their legacy API - well worth checking out to compare against this one. https://docs.superagent.sh/legacy/llms/superagent-lm-redact-...
  • hiAndrewQuinn
    I'm surprised nobody else has commented on this. This is a very straightforward and useful thing for a small locally runnable model to do.
  • 7777777phil
    > The model is available today under the Apache 2.0 license on Hugging Face (opens in a new window) and Github (opens in a new window).Bringing back the Open to OpenAI..
  • Havoc
    50M effective parameters is impressively light. Is there a similarly light model on the prompt injection side? Most of the mainstream ones seem heavier
  • freakynit
    Can someone explaon how can I reconstruct the original entities back if there are, for example, more than one person names?
  • I_am_tiberius
    I assume they use this model to be able to train new models with user data.
  • flashdesk
    This is exactly where stochastic approaches feel uncomfortable.For anything touching security or privacy, even small inconsistencies can quickly erode trust.
  • flashdesk
    This is where stochastic approaches start to feel a bit uncomfortable.Even small mistakes can make something dealing with sensitive data hard to trust. It seems useful as a first pass, but I’d probably still want some deterministic checks or a human in the loop to feel confident using it.
  • ares623
    This looks actually useful. But can someone help me understand how you address the non-perfect scores: "Privacy Filter achieves an F1 score of 96% (94.04% precision and 98.04% recall)."How would you actually use this if it can fail redacting 4% of the data. How do you reliably know which 4% failed?
  • ndom91
    Where's the gguf from Unsloth and co?
  • nickthegreek
    [dead]
  • haricomputer
    [dead]
  • y0eswddl
    [flagged]