How to Build lxml and Get Its Debug Symbols on Windows

If you're one of those lost souls looking around to debug lxml on Windows, then you're most likely having some trouble.

One of my Python applications was recently plagued by a memory leak. And I desperately needed to get lxml's debug symbols, because umdh simply refused to display function names associated with it in its logs. I was thus unable to see where the problem was exactly.

I finally figured out how to get lxml's debug symbols, but it turned out to be quite harder than I thought. So read on to discover my findings.

What's the problem, really?

If you use a 32-bit version of Python, then you're lucky. To compile lxml and get a wheel file, all you need to do is run this command1:

python setup.py bdist_wheel --static-deps  

This will essentially download all lxml's dependencies (as static libraries), compile lxml and then statically link against its dependencies. Finally, this will bundle everything in a nice and cute wheel file, ready to be installed.

But there's a catch! You will not get lxml's debug symbols. So if you ever need to debug lxml or one of its dependencies, you'll end up in the same boat as me!

Now, if you use a 64-bit version of Python, then apparently you're looking for even more trouble. The build scripts shipped with lxml were actually not made to build a 64-bit version of the library. If you try it, they will build a 64-bit version of lxml, and link to a 32-bit version of its dependencies, resulting in the following linker errors:

...
lxml.etree.obj : error LNK2001: unresolved external symbol __imp_xsltDocDefaultLoader  
lxml.etree.obj : error LNK2001: unresolved external symbol __imp_xsltLibxsltVersion  
lxml.etree.obj : error LNK2019: unresolved external symbol xmlSetExternalEntityLoader referenced in function PyInit_etree  
lxml.etree.obj : error LNK2019: unresolved external symbol xmlGetExternalEntityLoader referenced in function PyInit_etree  
lxml.etree.obj : error LNK2019: unresolved external symbol __xmlParserVersion referenced in function PyInit_etree  
lxml.etree.obj : error LNK2019: unresolved external symbol xmlInitParser referenced in function PyInit_etree  
lxml.etree.obj : error LNK2019: unresolved external symbol xmlThrDefLineNumbersDefaultValue referenced in function PyInit_etree  
lxml.etree.obj : error LNK2019: unresolved external symbol xmlThrDefIndentTreeOutput referenced in function PyInit_etree  
build\lib.win-amd64-3.3\lxml\etree.pyd : fatal error LNK1120: 230 unresolved externals  
error: command '"c:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\Bin\amd64\link.exe"' failed with exit status 1120  

The above is mainly why I told myself: "Screw it! I'll figure out how to make it work myself!"

Okay, so what's the solution?

So I came up with a fairly complete procedure to build lxml and get its debug symbols. The main challenge was to find the right combination of tools and steps to make it work. Here it is for your pleasure:

  1. I installed Visual C++ Build Tools and Powershell 4.0.
  2. I opened a Powershell prompt and installed the Powershell Community Extensions:

    PS C:\Users\acme> Set-ExecutionPolicy remotesigned
    PS C:\Users\acme> Find-Package pscx | ? ProviderName -eq PSModule | Install-Package -Force
    Name                           Version          Source           Summary
    ----                           -------          ------           -------
    Pscx                           3.2.2            https://www.p...
    
  3. I cloned the libxml2-win-binaries repository, which contains all of lxml's dependencies2.

    PS C:\Users\acme> cd \tmp
    PS C:\tmp> git clone https://github.com/mhils/libxml2-win-binaries.git libxml2-win-binaries
    Cloning into 'libxml2-win-binaries'...
    ...
    PS C:\tmp> cd .\libxml2-win-binaries
    PS C:\tmp\libxml2-win-binaries> git submodule update --init --recursive
    Submodule 'libiconv' (https://github.com/winlibs/libiconv) registered for path 'libiconv'
    Submodule 'libxml2' (https://github.com/winlibs/libxml2) registered for path 'libxml2'
    Submodule 'libxslt' (https://github.com/winlibs/libxslt) registered for path 'libxslt'
    Submodule 'zlib' (https://github.com/winlibs/zlib.git) registered for path 'zlib'
    Cloning into 'libiconv'...
    ...
    Cloning into 'libxml2'...
    ...
    Cloning into 'libxslt'...
    ...
    Cloning into 'zlib'...
    ...
    
  4. I compiled my local copy of the libxml2-win-binaries repository.

    PS C:\tmp\libxml2-win-binaries> .\build.ps1 -x64
    Microsoft (R) Build Engine version 14.0.25420.1
    Copyright (C) Microsoft Corporation. All rights reserved.
    Building the projects in this solution one at a time. To enable parallel build, please add the "/m" switch.
    Build started 2016-11-21 16:08:29.
    Project "C:\tmp\libxml2-win-binaries\libiconv\MSVC14\libiconv.sln" on node 1 (default targets).
    ValidateSolutionConfiguration:
      Building solution configuration "Release|X64".
    ...
    
  5. I opened a standard command prompt and downloaded the latest release of lxml's source code, which contains a pre-generated version of lxml.etree.c:

    C:\tmp>curl -O http://lxml.de/files/lxml-3.6.4.tgz
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 3620k  100 3620k    0     0  71522      0  0:00:51  0:00:51 --:--:-- 88691
    
  6. I extracted the archive:

    c:\tmp>"C:\Program Files\7-Zip\7z.exe" e lxml-3.6.4.tgz
    ...
    C:\tmp>"C:\Program Files\7-Zip\7z.exe" x lxml-3.6.4.tar
    ...
    
  7. I copied all previously built archives to lxml's folder, and extracted them to a libs folder:

    C:\tmp>mkdir lxml-3.6.4\libs
    C:\tmp>copy c:\tmp\libxml2-win-binaries2\dist\*.zip c:\tmp\lxml-3.6.4\libs
    c:\tmp\libxml2-win-binaries2\dist\iconv-1.14.win64.zip
    c:\tmp\libxml2-win-binaries2\dist\libxml2-2.9.4.win64.zip
    c:\tmp\libxml2-win-binaries2\dist\libxslt-1.1.29.win64.zip
    c:\tmp\libxml2-win-binaries2\dist\zlib-1.2.8.win64.zip
            4 files(s) copied.
    C:\tmp>cd lxml-3.6.4\libs
    C:\tmp\lxml-3.6.4\libs>"C:\Program Files\7-Zip\7z.exe" x *.zip
    ...
    
  8. I edited a few paths defined at the top of the file lxml-3.6.4\setup.py:

    ...
    STATIC_INCLUDE_DIRS = [
        "C:\\tmp\\lxml-3.6.4\\libs\\iconv-1.14.win64\\include",
        "C:\\tmp\\lxml-3.6.4\\libs\\libxml2-2.9.4.win64\\include",
        "C:\\tmp\\lxml-3.6.4\\libs\\libxslt-1.1.29.win64\\include",
        "C:\\tmp\\lxml-3.6.4\\libs\\zlib-1.2.8.win64\\include"
    ]
    STATIC_LIBRARY_DIRS = [
        "C:\\tmp\\lxml-3.6.4\\libs\\iconv-1.14.win64\\lib",
        "C:\\tmp\\lxml-3.6.4\\libs\\libxml2-2.9.4.win64\\lib",
        "C:\\tmp\\lxml-3.6.4\\libs\\libxslt-1.1.29.win64\\lib",
        "C:\\tmp\\lxml-3.6.4\\libs\\zlib-1.2.8.win64\\lib"
    ]
    ...
    
  9. I patched the file C:\Python33\Lib\distutils\msvc9compiler.py to trick distutils into thinking that Visual C++ 2015 should be used to compile lxml3:

    1. At the top of the query_vcvarsall() function, I modified the vcvarsall variable:

      #vcvarsall = find_vcvarsall(version)
      vcvarsall = "c:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\vcvarsall.bat"
      
    2. In the MSVCCompiler class, inside the initialize() function, I added the necessary compiler and linker flags (/Z7 and /DEBUG respectively)4:

      def initialize(self, plat_name=None):
          ...
          if self.__arch == "x86":
              ...
          else:
              # Win64
              self.compile_options = [ '/nologo', '/Ox', '/MD', '/W3', '/GS-' ,
                                       '/DNDEBUG', '/Z7']
              ...
          self.ldflags_shared = ['/DLL', '/nologo', '/INCREMENTAL:NO', '/DEBUG']
          ...
      
  10. In a command prompt, I compiled lxml and built a wheel file:

    C:\tmp\lxml-3.6.4>python setup.py bdist_wheel --static
    Building lxml version 3.6.4.
    Building without Cython.
    ERROR: b"'xslt-config' not recognized as an internal or external command, operable program or batch file."
    ** make sure the development packages of libxml2 and libxslt are installed **
    Using build configuration of libxslt
    ...
    
  11. Finally, I uninstalled the version of lxml that I had previously installed, and installed the wheel file that I had just compiled:

    (env) G:\crawler>pip uninstall lxml
      ...
      Successfully uninstalled lxml-3.5.0  
    (env) G:\crawler>pip install c:\tmp\lxml-3.6.4\dist\lxml-3.6.4-cp33-no
    ne-win_amd64.whl
    Processing c:\tmp\lxml-3.6.4\dist\lxml-3.6.4-cp33-none-win_amd64.whl
    Installing collected packages: lxml
    Successfully installed lxml-3.6.4
    

And now, guess what? If you've successfully completed all these steps, you've got a fully working wheel file including lxml and its debug symbols! Awesome! Function names will now be properly displayed in debuggers like umdh and WinDbg.

Final thoughts

As we've seen in this post, building lxml and getting its debug symbols is not a straightforward endeavor. This is especially true if you're using a 64-bit version of Python on Windows.

For me, an alternative would have been to develop and build everything under Linux. Things would probably have been much easier, as it has a compiler and package manager by default. Unfortunately, I didn't have that luxury, as my current development machine is a Windows laptop5.

But now, what about you? Have you ever built a Python C extension yourself on Windows? What are the problems that you encountered? And did you find this post helpful? If so, please share your comments in the section below!

  1. This assumes that you have the right version of Visual Studio installed on your machine. More on this later on.

  2. Depending on the version of lxml that you need to debug, you may have to check out different changesets for the submodules of the libxml2-win-binaries repository.

  3. This ugly hack is necessary because distutils expects to use the exact same compiler and linker versions that were used to compile the Python distribution itself. For example, Python 3.3.5 for Windows was compiled using Visual C++ 2010, so distutils tries to build your extensions using Visual C++ 2010. If it's not found on your system, the build will fail. Don't forget to make a backup of msvc9compiler.py before attempting this stunt.

  4. For the compiler, /Z7 is necessary because it includes all debugging symbols directly in the .obj files, instead of generating intermediary .pdb files. This is important, because lxml actually contains 2 static libraries (lxml.etree and lxml.objectify), and each one of these libraries would otherwise write to the same .pdb file (named vc140.pdb), thus making it unusable by popular debugging tools (WinDbg, umdh, Visual Studio, etc.). For the linker, /DEBUG is necessary because it will generate a .pdb file for each library (etree.pdb and objectify.pdb).

  5. I'm working on that! I plan on installing Ubuntu eventually.