I download a file using the get function of Python requests library. For storing the file, I'd like to determine the filename they way a web browser would for its 'save' or 'save as . ' dialog. Easy, right? I can just get it from the Content-Disposition HTTP header, accessible on the response object:
import re d = r.headers['content-disposition'] fname = re.findall("filename=(.+)", d)
But looking more closely at this topic, it isn't that easy: According to RFC 6266 section 4.3, and the grammar in the section 4.1, the value can be an unquoted token (e.g. the_report.pdf ) or a quoted string that can also contain whitespace (e.g. "the report.pdf" ) and escape sequences (the latter are discouraged, though, thus their handling isn't a hard requirement for me). Further,
when both "filename" and "filename*" are present in a single header field value, [we] SHOULD pick "filename*" and ignore "filename".
The value of filename* , though, is yet a bit more complicated than the one of filename . Also, the RFC seems to allow for additional whitespace around the = . Thus, for the examples listed in the RFC, I'd want the following results:
Content-Disposition: INLINE; FILENAME= "an example.html"
Content-Disposition: attachment; filename*= UTF-8''%e2%82%ac%20rates
Content-Disposition: attachment; filename="EURO rates"; filename*=utf-8''%e2%82%ac%20rates
filename: € rates here, too (not EURO rates , as filename* takes precedence) I could implement the parsing of the Content-Disposition header I get from requests accordingly myself, but if I can avoid it and use an existing proven implementation instead, I'd prefer that. Is there a Python library that can do this?
What it doesn't have to handle (but if it does, even better) as I can do that myself:
Though it should report that consistently (be it by raising or by returning None or '' ), so that I can let my own fall-back kick in.