Ethereal-dev: [Ethereal-dev] ethereal 0.10.3 hangs on Redhat Linux 9 (glibc 2.3.2)

Note: This archive is from the project's previous web site, ethereal.com. This list is no longer active.

From: craig <craig@xxxxxxxxxxx>
Date: Tue, 17 Aug 2004 16:32:04 -0700
hi folks,

i've run into a problem with several release of ethereal where it will
hang while decoding packets during live capture or in a capture file.

when this occurs strace shows that ethereal is blocked in the futex()
system call.  this occurs.

it turns out that the problem is caused by either the glibc implementation
of gethostbyaddr() or ethereal's use of signals and longjmp(3) when
calling it (depending on your perspective).

i fixed the problem with the attached patch to epan/resolv.c, which
insures that AVOID_DNS_TIMEOUT is never defined.  (the comments at the
front of epan/libresolv.c make it clear that the author considers the
code associated with AVOID_DNS_TIMEOUT somewhat dubious at best.) you'll
likely want to do something more elegant for your release.

note that gethostbyaddr() seems to timeout after 5 sec all by itself,
so the AVOID_DNS_TIMEOUT isn't really required.

the problem is that glibc has added calls to pthread_mutex_lock()
to gethostbyaddr(), perhaps in an attempt to make it thread safe.
this disassmebly of gethostbyaddr() from /usr/lib/libc.a makes that
clear:

00000000 <gethostbyaddr>:
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   57                      push   %edi
   4:   56                      push   %esi
   5:   53                      push   %ebx
   6:   b8 00 00 00 00          mov    $0x0,%eax
                        7: R_386_32     __pthread_mutex_lock
   b:   83 ec 0c                sub    $0xc,%esp
   e:   85 c0                   test   %eax,%eax
  10:   c7 45 f0 00 00 00 00    movl   $0x0,0xfffffff0(%ebp)
  17:   74 10                   je     29 <gethostbyaddr+0x29>
  19:   83 ec 0c                sub    $0xc,%esp
  1c:   68 18 00 00 00          push   $0x18
                        1d: R_386_32    .bss
  21:   e8 fc ff ff ff          call   22 <gethostbyaddr+0x22>
                        22: R_386_PC32  __pthread_mutex_lock
  ...

meanwhile, host_name_lookup() in epan/libresolv.c includes code to set
an alarm and long jump out of gethostbyaddr() if the alarm ticks over:

	static gchar *host_name_lookup(guint addr, gboolean *found)
	{
	  int hash_idx;
	  hashname_t * volatile tp;
	  struct hostent *hostp;
	  [... deletia ...]

	  if (addr != 0 && (g_resolv_flags & RESOLV_NETWORK)) {
	  /* Use async DNS if possible, else fall back to timeouts,
	   * else call gethostbyaddr and hope for the best
	   */

	# ifdef AVOID_DNS_TIMEOUT

	    /* Quick hack to avoid DNS/YP timeout */

	    if (!setjmp(hostname_env)) {
	      signal(SIGALRM, abort_network_query);
	      alarm(DNS_TIMEOUT);
	# endif /* AVOID_DNS_TIMEOUT */

	      hostp = gethostbyaddr((char *)&addr, 4, AF_INET);

	# ifdef AVOID_DNS_TIMEOUT
	      alarm(0);
	# endif /* AVOID_DNS_TIMEOUT */

so, if a call to gethostbyaddr() takes more than 2 sec, the signal occurs
and we longjump out of gethostbyaddr() without releasing the mutex.
then, when ethereal calls gethostbyaddr() again it deadlocks against
the lock.

this bug would seem to exist for anyone using recent versions of glibc.

cheers,

craig.

P.S.  occasionally my mozilla hangs as well, and strace shows it blocked
    in futex().  i don't know if it suffers from the same problem or not.

P.P.S.  here's part of an strace of ethereal running with my fix which
    shows the lookup timing out all by itself.

	connect(8, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.137.9.163")}, 28) = 0
	send(8, "X\375\1\0\0\1\0\0\0\0\0\0\003250\00227\003145\003218\7"..., 45, 0) = 45gettimeofday({1092785388, 119730}, NULL) = 0
	poll([{fd=8, events=POLLIN}], 1, 5000)  = 0


-- 
{apple,amdahl}!veritas!craig				      craig@xxxxxxxxxxx
(415) 668-3564 (h)					      (650) 527-8520 (w)
--- /usr/src/redhat/BUILD/ethereal-0.10.3/epan/resolv.c	2004-01-25 07:46:31.000000000 -0800
+++ /local/src/cmd/ethereal/ethereal-0.10.3/epan/resolv.c	2004-07-20 12:48:24.000000000 -0700
@@ -46,7 +46,7 @@
  * code in tcpdump, to avoid those sorts of problems, and that was
  * picked up by tcpdump.org tcpdump.
  */
-#if !defined(WIN32) && !defined(__APPLE__)
+#if !defined(WIN32) && !defined(__APPLE__) && 0
 #ifndef AVOID_DNS_TIMEOUT
 #define AVOID_DNS_TIMEOUT
 #endif