After fixing my previous bug, I thought that I would finally be able to fully unleash the beast, my distributed web crawler. But it took only a few more hours of testing and crawling before I encountered another issue. Apparently, some child process spawned by my crawler - which was responsible for the actual crawling - was hung. Completely hung. But this time, I felt that the cause was different from the previous bug. Was it?
In this second and final post of the series, you will discover how I used the various commands of python-debuginfo to track down a nasty bug in my production Python code1.
The symptoms
So after a few hours of crawling, as a routine check, I thought about running the top command to validate that everything was running smoothly.
[acme@static-crawler-1 logs]$ top
top - 23:40:50 up 5:51, 1 user, load average: 0.64, 1.03, 1.28
Tasks: 104 total, 3 running, 101 sleeping, 0 stopped, 0 zombie
%Cpu(s): 17.3 us, 2.1 sy, 0.0 ni, 76.3 id, 0.0 wa, 0.0 hi, 0.4 si, 3.9 st
KiB Mem : 500840 total, 6424 free, 401004 used, 93412 buff/cache
KiB Swap: 511996 total, 442104 free, 69892 used. 47164 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
...
2605 acme 20 0 468292 60864 3808 S 0.0 12.2 4:09.62 python
2606 acme 20 0 468420 58536 3624 S 0.0 11.7 5:09.23 python
2607 acme 20 0 462612 46848 3836 S 0.0 9.4 3:47.49 python
2609 acme 20 0 466900 27500 532 S 0.0 5.5 1:42.92 python
At first sight, everything looked good. The CPU and memory usage of the child processes were both under control. But wait... What about the CPU time that had been used (TIME+ column)? One of the child processes had taken only 1 hour 42 minutes of CPU time, while the others had taken much more. "Very weird!", I thought. Remembering this timeless advice given by some mystery individual2:
Don't ignore subtle but weird details while debugging. They might matter more than you think.
I decided to look at log files generated by the crawler and its child processes:
[acme@static-crawler-1 logs]$ ls -l
total 41716
-rw-r--r-- 1 acme acme 2393905 Sep 22 23:38 static_crawler.log2597
-rw-r--r-- 1 acme acme 5474742 Sep 22 23:38 static_crawler.log2602
-rw-r--r-- 1 acme acme 5286695 Sep 22 23:37 static_crawler.log2603
-rw-r--r-- 1 acme acme 4967631 Sep 22 23:35 static_crawler.log2604
-rw-r--r-- 1 acme acme 5305057 Sep 22 23:37 static_crawler.log2605
-rw-r--r-- 1 acme acme 5953611 Sep 22 23:37 static_crawler.log2606
-rw-r--r-- 1 acme acme 4813276 Sep 22 23:38 static_crawler.log2607
-rw-r--r-- 1 acme acme 6028218 Sep 22 23:38 static_crawler.log2608
-rw-r--r-- 1 acme acme 2438108 Sep 22 21:19 static_crawler.log2609
Huh? The process with PID 2609 had stopped updating its log file. Damn it! Its timestamp (modification time) was still 21:19, while the others were up-to-date. Why?
So I opened the log file using vim and tried to make sense of what had happened:
[acme@static-crawler-1 logs]$ vim static_crawler.log2609
...
2016-09-22 21:19:01,850 - crawlers.web.standalone_web_crawler - DEBUG - Crawling URL 'http://raghts.com'.
2016-09-22 21:19:01,954 - requests.packages.urllib3.connectionpool - INFO - Starting new HTTPS connection (1): raghts.com
And that was it! Obviously, an HTTP connection was stuck on http://raghts.com/. But what was the exact origin of that connection request in the code? This was a relevant question to ask.
Tracking down the bug with python-debuginfo
Now that I was familiar with gdb and python-debuginfo, I decided to give them yet another shot.
So I began by activating my Python virtual environment3:
[acme@static-crawler-1 logs]$ cd ~/crawler
[acme@static-crawler-1 crawler]$ source env/bin/activate
And then I attached to the hung process:
(env) [acme@static-crawler-1 ~]$ gdb python 2609
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/acme/crawler/env/bin/python3.3...done.
Attaching to program: /home/acme/crawler/env/bin/python, process 2609
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
...
Alright, so now that I was attached, all I needed to do was to import the python-debuginfo extension shipped with my Python version (3.3.6).
(gdb) python
>import sys
>sys.path.insert(0, "/home/acme/Python-3.3.6/Tools/gdb")
>import libpython
>end
The next obvious step was to list all Python stack traces to see where the code was stuck:
(gdb) thread apply all py-bt
Thread 3 (Thread 0x7f62adc90700 (LWP 2635)):
Traceback (most recent call first):
File "/home/acme/crawler/env/lib/python3.3/site-packages/pymongo/periodic_executor.py", line 103, in _run
time.sleep(self._min_interval)
File "/usr/local/lib/python3.3/threading.py", line 858, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.3/threading.py", line 901, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.3/threading.py", line 878, in _bootstrap
self._bootstrap_inner()
Thread 2 (Thread 0x7f62ad48f700 (LWP 2637)):
Traceback (most recent call first):
File "/home/acme/crawler/env/lib/python3.3/site-packages/pymongo/periodic_executor.py", line 103, in _run
time.sleep(self._min_interval)
File "/usr/local/lib/python3.3/threading.py", line 858, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.3/threading.py", line 901, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.3/threading.py", line 878, in _bootstrap
self._bootstrap_inner()
Thread 1 (Thread 0x7f62ae511700 (LWP 2609)):
Traceback (most recent call first):
File "/usr/local/lib/python3.3/socket.py", line 297, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.3/http/client.py", line 320, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/lib/python3.3/http/client.py", line 358, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.3/http/client.py", line 1147, in getresponse
response.begin()
...
File "/home/acme/crawler/env/lib/python3.3/site-packages/reppy/cache.py", line 81, in fetch
req = self.session.get(robots_url, *args, **kwargs)
File "/home/acme/crawler/env/lib/python3.3/site-packages/reppy/cache.py", line 69, in cache
fetched = self.fetch(url, *args, **kwargs)
File "/home/acme/crawler/env/lib/python3.3/site-packages/reppy/cache.py", line 64, in find
return self.cache(url, *self.args, **self.kwargs)
File "/home/acme/crawler/env/lib/python3.3/site-packages/reppy/cache.py", line 111, in allowed
return self.find(url, fetch_if_missing=True).allowed(
File "/home/acme/crawler/web/throttling/url_action_manager.py", line 74, in get_action
is_allowed = self._robots_cache.allowed(url, self._user_agent)
...
Fortunately this time, it was not some kind of obscure hang in a C module. The problem was right there in thread 1, where it was stuck on an HTTP GET request. That request had originated from a class that I'm using to handle robots.txt files while crawling (to be more precise, it's reppy's RobotsCache). But I had battle-tested that portion of my code, so why would it hang?
What I really needed was to get the actual URL that caused the hang, then maybe I would be able to reproduce the issue. By looking at the code of libpython.py - the actual code of the python-debuginfo extension - I found out a very nice command called py-bt-full4 which would in theory give me a richer stack trace with all the information that I needed. So I tried it.
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f62ae511700 (LWP 2609))]
#0 0x00007f62bf1e85bb in recv () from /lib64/libpthread.so.0
(gdb) py-bt-full
#5 Frame 0x7f62a85ae530, for file /usr/local/lib/python3.3/socket.py, line 297, in readinto (self=<SocketIO(_mode='rb', _sock=<socket at remote 0x7f62a796ddb8>,...
return self._sock.recv_into(b)
#17 Frame 0x7f62a878eee0, for file /usr/local/lib/python3.3/http/client.py, line 320, in _read_status (self=<HTTPResponse(debuglevel=0, chunk_left='UNKNOWN', wi...
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
#20 Frame 0x7f62a81ee1e0, for file /usr/local/lib/python3.3/http/client.py, line 358, in begin (self=<HTTPResponse(debuglevel=0, chunk_left='UNKNOWN', will_clos...
version, status, reason = self._read_status()
#23 Frame 0x7f62a87e5590, for file /usr/local/lib/python3.3/http/client.py, line 1147, in getresponse (self=<HTTPConnection(_tunnel_host=None, _HTTPConnection__...
response.begin()
...
#54 Frame 0x7f62a87430a0, for file /home/acme/crawler/env/lib/python3.3/site-packages/reppy/cache.py, line 81, in fetch (self=<RobotsCache(args=(), kwargs=...
req = self.session.get(robots_url, *args, **kwargs)
#59 Frame 0x7f62a869bb80, for file /home/acme/crawler/env/lib/python3.3/site-packages/reppy/cache.py, line 69, in cache (self=<RobotsCache(args=(), kwargs=...
fetched = self.fetch(url, *args, **kwargs)
#64 Frame 0x7f62a822c5b0, for file /home/acme/crawler/env/lib/python3.3/site-packages/reppy/cache.py, line 64, in find (self=<RobotsCache(args=(), kwargs={...
return self.cache(url, *self.args, **self.kwargs)
#68 Frame 0x7f62a83a2da0, for file /home/acme/crawler/env/lib/python3.3/site-packages/reppy/cache.py, line 111, in allowed (url='http://raghts.com/main')
return self.find(url, fetch_if_missing=True).allowed(
#72 Frame 0x7f62a8314650, for file /home/acme/crawler/web/throttling/url_action_manager.py, line 74, in get_action (self=<UrlActionManager(_robots...
is_allowed = self._robots_cache.allowed(url, self._user_agent)
...
Awesome! So in frame 68, I could see that the parameter url='http://raghts.com/main' had been passed to the RobotsCache.allowed() function. And based on frames 54-64, I could see that its associated robots.txt had not been cached yet, so it was being downloaded. Just to be sure, I thought that it might be a good idea to validate which URL had been used to download robots.txt. Since frame 5 was currently selected, I had to go up several frames to get the information I needed.
(gdb) py-up
#17 Frame 0x7f62a878eee0, for file /usr/local/lib/python3.3/http/client.py, line 320, in _read_status (self=<HTTPResponse(debuglevel=0, chunk_left='UNKNOWN', wi...
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
(gdb) py-up
#20 Frame 0x7f62a81ee1e0, for file /usr/local/lib/python3.3/http/client.py, line 358, in begin (self=<HTTPResponse(debuglevel=0, chunk_left='UNKNOWN', will_clos...
version, status, reason = self._read_status()
(gdb) py-up
#23 Frame 0x7f62a87e5590, for file /usr/local/lib/python3.3/http/client.py, line 1147, in getresponse (self=<HTTPConnection(_tunnel_host=None, _HTTPConnection__...
response.begin()
(gdb) py-up
#26 Frame 0x7f62a895f120, for file /home/acme/crawler/env/lib/python3.3/site-packages/requests/packages/urllib3/connectionpool.py, line 387, in _make_reque...
httplib_response = conn.getresponse()
(gdb) py-up
#30 Frame 0x7f62a85f5510, for file /home/acme/crawler/env/lib/python3.3/site-packages/requests/packages/urllib3/connectionpool.py, line 578, in urlopen (se...
chunked=chunked)
(gdb) py-up
#34 Frame 0x7f62a86a8c60, for file /home/acme/crawler/env/lib/python3.3/site-packages/requests/adapters.py, line 403, in send (self=<HTTPAdapter(max_retrie...
timeout=timeout
(gdb) py-up
#39 Frame 0x7f62a8227340, for file /home/acme/crawler/env/lib/python3.3/site-packages/requests/sessions.py, line 585, in send (self=<Session(adapters=<Orde...
r = adapter.send(request, **kwargs)
(gdb) py-up
#44 Frame 0x7f62a8855ae0, for file /home/acme/crawler/env/lib/python3.3/site-packages/requests/sessions.py, line 475, in request (self=<Session(adapters=<O...
resp = self.send(prep, **send_kwargs)
(gdb) py-up
#49 Frame 0x7f62b6075d38, for file /home/acme/crawler/env/lib/python3.3/site-packages/requests/sessions.py, line 487, in get (self=<Session(adapters=<Order...
return self.request('GET', url, **kwargs)
(gdb) py-up
#54 Frame 0x7f62a87430a0, for file /home/acme/crawler/env/lib/python3.3/site-packages/reppy/cache.py, line 81, in fetch (self=<RobotsCache(args=(), kwargs=...
req = self.session.get(robots_url, *args, **kwargs)
By displaying the code surrounding the stack frame, I would have a much better context:
(gdb) py-list
91
92 try:
93 # First things first, fetch the thing
94 robots_url = Utility.roboturl(url)
95 logger.debug('Fetching %s' % robots_url)
>96 req = self.session.get(robots_url, *args, **kwargs)
97 ttl = max(self.min_ttl, Utility.get_ttl(req.headers, self.default_ttl))
98 # And now parse the thing and return it
99 return parser.Rules(robots_url, req.status_code, req.content,
100 time.time() + ttl,
101 disallow_forbidden=self.disallow_forbidden)
And then I could simply display local variables:
(gdb) py-locals
self = <RobotsCache(args=(), kwargs={'allow_redirects': True, 'headers': {'User-Agent': 'PISCES-BOT (+http://www.astrology.com/)'}, 'verify': False}, disallow_forbidde...
url = 'http://raghts.com/main'
args = ()
kwargs = {'allow_redirects': True, 'headers': {'User-Agent': 'PISCES-BOT (+http://www.astrology.com/)'}, 'verify': False}
robots_url = 'http://raghts.com/robots.txt'
Perfect! So the URL that had been used to download robots.txt was as expected http://raghts.com/robots.txt. And I had a GET request stuck on that URL. Was my code to blame? A simple test with curl would suffice to confirm it:
[acme@static-crawler-1 logs]$ curl -X GET http://raghts.com/robots.txt --max-time 30 --verbose
* About to connect() to raghts.com port 80 (#0)
* Trying 192.0.2.0...
* Connected to raghts.com (192.0.2.0) port 80 (#0)
> GET /robots.txt HTTP/1.1
> User-Agent: curl/7.29.0
> Host: raghts.com
> Accept: */*
>
* Operation timed out after 30006 milliseconds with 0 out of -1 bytes received
* Closing connection 0
curl: (28) Operation timed out after 30006 milliseconds with 0 out of -1 bytes received
The result was very interesting. curl was able to connect to the URL, but it was unable to download any data. And because of that, it timed out. So the problem was clearly related to timeouts. I pushed my research a bit further and realized that the behavior of curl was easily understandable. Basically, HTTP libraries (and hence requests, the underlying library used by RobotsCache) typically use two types of timeouts:
- Connect timeout: maximum number of seconds to wait for the socket to connect.
- Read timeout: maximum number of seconds to wait for the socket to receive data. Any interruption longer than that - either at the beginning or during the download per se - would lead to a timeout.
So clearly, I had been hit by the latter type of timeout.
The solution
So I peeked into my code and well... I discovered what every shady psychic would have been able to predict5.
...
return RobotsCache(verify=False,
allow_redirects=True,
headers={
"User-Agent": user_agent
})
I was creating an instance of RobotsCache without setting any timeout! Bummer! This totally explained why it was stuck forever on the problematic URL. Fortunately, the fix was really simple:
...
return RobotsCache(verify=False,
allow_redirects=True,
headers={
"User-Agent": user_agent
},
timeout=timeout)
(resounding trumpet solo) Problem fixed!
Bonus section: a python-debuginfo cheatsheet!
Before I conclude, I thought that it would be handy to provide you with a comprehensive list of python-debuginfo commands. So here we go.
Command | Purpose |
---|---|
py-list | Prints the Python source code for the current stack frame in the currently selected thread. |
py-bt | Prints a basic stacktrace (backtrace) for the currently selected thread. |
py-bt-full | Prints a richer stacktrace (backtrace) for the currently selected thread, including frame numbers and parameter values passed to functions. |
py-up | Moves up one frame in the callstack of the currently selected thread. |
py-down | Moves down one frame in the callstack of the currently selected thread. |
py-print | Prints the value of the specified variable name (e.g. py-print {the_variable_name}). Local, global and builtin variables are supported, but not class members such as 'self.the_variable_name'. |
py-locals | Prints the value of all local variables for the current stack frame in the currently selected thread. |
Final thoughts
As we've seen in this post, gdb and python-debuginfo are your best friends when it comes to tracking down bugs and hangs in production Python code. More specifically:
- gdb allows you to attach to an already running Python process.
- python-debuginfo offers you a comprehensive set of commands within gdb that let you walk any Python callstack. Those commands allow you to inspect local variables and parameters passed to functions.
Without those tools, it would have been much harder to pinpoint the exact cause and location of my bug. But now that I'm more familiar with them, I feel... surprisingly... quite excited to tackle new bugs! This definitely reminds me of the "AHA!" moment that I had in the good old days of .NET when I realized that I finally understood how WinDbg and SOS work. It's the same feeling, really!
But what about you? Have you ever used gdb and python-debuginfo? And was my post helpful to you? If so, please share your comments in the comment section below!
Update (14/10/2016): this post was featured in the Awesome Python, Python Weekly, Pycoders Weekly and ImportPython newsletters. If you get a chance to subscribe to any one of them, you won't be disappointed! Thank you all!
For privacy purposes, I deliberately obfuscated all URLs, IP addresses, user agents and domain names in this post. No psychic was hurt while I was writing this post. ↩
Disclaimer: I may occasionally produce great quotes. I don't know if this one will be timeless, but so far it's been quite useful. Taken from my previous blog post. ↩
If you use a specific version of Python (e.g. Python 3.3.6), activating your virtual environment is mandatory. This is actually how gdb will load the symbols of your specific Python version, and not those of the Python version shipped by default with your OS (e.g. Python 2.7 in CentOS 7.2). ↩
Unfortunately, at the time of writing this post, the py-bt-full command was not really documented, even on https://fedoraproject.org/wiki/Features/EasierPythonDebugging. Its documentation in libpython.py is not helpful either, as it's a copy-paste of py-bt's documentation string. ↩
Actually, I doubt that any psychic would have been able to help me at all :) However, it was rather predictable that my code wasn't handling timeouts properly, as I had already encountered a similar problem previously. The bottom line is that almost every bug looks simple (and even predictable) in retrospective, no matter how hard it was to figure out. ↩