Wireshark-dev: [Wireshark-dev] Dealing with wrong Content-Types in HTTP

From: Nicolás Alvarez <nicolas.alvarez@xxxxxxxxx>
Date: Mon, 31 May 2021 00:58:21 -0300
Hello developers,

While looking at traffic from Apple devices, I'm seeing lots of
non-browser HTTP(S) requests and responses that have incorrect
Content-Type headers. For example, when an iPhone talks to Apple
CoreLocation servers, it sends HTTPS requests with Content-Type:
application/x-www-form-urlencoded and a custom binary format in the
body (I think it's protobuf-based but I didn't decode it yet).
Obviously the "form data" dissector then fails to do anything useful
with it.

As another example,
https://configuration.apple.com/configurations/pep/config/geo/networkDefaults-ios-12.2.plist
claims to be application/x-troff-man (?!) despite being XML Plist,
probably because their misconfigured web server looks at the .2.*
extension and thinks it's a manpage.

Finally, many requests and responses use text/xml or application/xml
and the content really is XML, but it would be better to use a
specific dissector for the specific XML-based format being used. I'm
making a dissector for Apple XML Plists and having trouble getting
Wireshark to use it instead of the generic XML dissector. This is also
an issue with other generic formats like JSON.


There are two things Wireshark could have to help with these incorrect
headers. First, the *user* should be able to override the MIME type to
make Wireshark run a different dissector for an HTTP request or
response body.

Supposedly this is already possible in HTTP2, since you can select
"HTTP2 content type in stream" (http2.streamid) in the "Decode As"
dialog. However, I never got this to actually work in practice, and I
don't understand how it's supposed to work, since an http2.streamid
isn't globally unique, it only makes sense in the context of a
specific tcp.stream (why would you want stream ID 3 of *all HTTP2
connections* to have a different content type? and does it apply to
request, response, or both?).

For HTTP1, there is nothing, and I'm not sure how I would solve it.
Perhaps using the URL in Decode As would be good enough for a start?


Secondly, I think dissectors should be able to override the MIME type as well.

In some cases a heuristic dissector can guess the format from the
contents, but currently the HTTP dissector only calls heuristic
dissectors for the body if nothing else worked. For example, if there
was a heuristic dissector for CoreLocation responses, the HTTP
dissector wouldn't even try it, because it already found a registered
dissector for MIME type application/x-www-form-urlencoded and used
that. Do we need a "try heuristic sub-dissectors first" preference for
HTTP, like TCP has?

Other cases are worse, because it's not easy to detect the format from
the contents or it would be expensive or error-prone to try those
heuristics on every single HTTP body. However, it may be a proprietary
format/protocol used with a specific server, in which case a
sub-dissector could make decisions based on the URL or other headers.
For example, the hypothetical dissector for Apple's Proprietary
CoreLocation Format could accept a packet if the URL is
gs-loc.apple.com/clls/wloc, rather than looking at the actual bytes.
Currently there seems to be no infrastructure to do this (and a
dissector table with hostnames wouldn't be enough). Media type
dissectors don't seem to even have access to the URL. Does it seem
like a good idea to add it? Maybe HTTP can put that information in
proto_data for heur dissectors to look at?

Thoughts welcome :)

-- 
Nicolás