DER-bot

Hi there! My name is Benoit and I'm a human being. You may have stumbled upon this page by accident... or maybe you've just found out that a crawler named DER-bot visited your website.

Well, this is a little side project of mine and it was really fun to develop! :)

If it ever caused you any problem, please know that I'm deeply sorry and that this was not intended at all. I will make every reasonable effort to cooperate with you so that this no longer causes any problem.

Please read on to know the details. And feel free to contact me for any question or request: hi at benbernardblog.com.

Purpose

  1. DER is an acronym for Developer Experimental Research. The goal of my crawler is to validate information contained in the Stack Exchange Data Dump.
  2. I'm in no way affiliated to either Stack Exchange Inc., stackoverflow.com or any other company. I'm a sole individual, working on my own.
  3. This is done out of curiosity, as a learning experiment and for personal research only.
  4. I'm deeply convinced that you and the Stack Overflow community could benefit from my experiment. After all, many questions are left unanswered: what is the real proportion of men/women among programmers? In which cities do programmers live? Is there such a thing as "talent shortage"? Are there any hidden talent pools in the world? The list of possible questions is endless. Please know that providing useful information back to you and the community is the whole point of my experiment.
  5. I may eventually publish only a partial and transformed version of the "raw" crawled data (i.e. a derivative dataset). Those results:
    • will only be represented as charts.
    • will be anonymized in every way possible, so it will NOT contain personally identifiable information (i.e. it will be impossible to trace information back to a specific person or user).
    • will NOT falsely represent data or include any negative/discriminating remarks about your website, service or business.
    • will include attribution when possible.
    • will be shared under the same license as the Stack Exchange Data Dump.
  6. Be assured that in the specific context of this project, I will NOT:
    • compete with your website, service or business in any way.
    • try to hurt your business, brand(s), reputation, quality, value, image, staff, members or users in any way.
    • collect or harvest non-public, or highly sensitive/secret data about you, your users or your server(s).
    • extract email addresses from your website or service and store them, either for spam or any other purpose.
    • perform a distributed denial-of-service (DDoS) attack against your website or service, or deliberately try to hurt its performance in any way.
    • maliciously try to violate your intellectual property or copyright rights, or those of your users.
    • crawl beyond the strict minimum required by my research, such as your entire website or service.
    • rent/lease/loan/trade/sell/re-sell the "raw" crawled data or any derivative dataset, or make any money out of it.
    • use the "raw" crawled data or any derivative dataset for illegal/illicit/questionable activities.
    • publish/republish/post/distribute/share the "raw" crawled data.
    • let any other individual, third party or business entity have access to the "raw" crawled data. Besides, just to make sure that I respect applicable laws, I put in place a lot of security measures to prevent any leak from occurring (firewall rules, encrypted communications, very strong passwords, encrypted data).
    • use highly sophisticated techniques to circumvent security measures that you put in place in your website or service.
    • try to sell anything to your staff, or to users of your website or service, if applicable.
    • try to hire any member of your staff.
    • make my code public, or offer guidance on how to crawl your site or service specifically.

Politeness measures

  1. DER-bot limits the crawl depth to <= 1, which means that the crawler will crawl only a very small subset of your website.
  2. DER-bot respects the directives of your robots.txt, which includes:
    • Allow/disallow directives.
    • Crawl rate. If you didn't define any 'crawl-rate', then it defaults to 15 seconds, which means that any two consecutive requests to your website/service will be spaced out by at least 15 seconds. This is done to ensure that the performance of your website/service is not affected in any way. However, be aware that in my initial test runs, the crawl rate may be a little bit higher than that (as I debug things along the way), but it will only be temporary.
  3. For the most part, DER-bot only crawls HTML content (not images, Javascript, binary files, etc.), so this should consume only minimal bandwidth. However, please be aware that one of the nodes uses the Chromium web browser engine to crawl dynamically generated websites (JavaScript). Crawling their full HTML content would otherwise be impossible. For this node (see below), 'DER-bot' should appear in the user agent string when it downloads robots.txt from your website. Beyond that, it will display Chromium's user agent string.
  4. DER-bot explicitly excludes a list of websites or services whose terms of service clearly forbid unauthorized crawlers. However, please be aware that:
    • the list may be incomplete. I will do my best to keep it up-to-date and upon request, I will gladly add your website or service to this list.
    • some URL redirection services (e.g. bit.ly, goo.gl, etc.), other services/mechanisms or even bugs may cause the crawler to inadventently visit pages from such websites or services. In those cases, be assured that content from those unwanted pages WILL NOT be crawled, but if it ever happens (bugs exist your know!), it will be destroyed.

Excluding your website/service from the crawling

  • Add the following lines in your robots.txt to tell "DER-bot" to stop crawling your website/service. It should take effect within 1 hour: User-agent: DER-bot Disallow: *
  • To be excluded right away, you can also block the crawler by IP address using standard firewall rules. The addresses it uses are (replace hyphens by periods):
    • 192-241-132-78
    • 67-205-144-185
    • 67-205-158-186
    • 67-205-158-199
    • 198-199-80-150 (Chromium node)
  • If your website or service is hosted on Linux (CentOS), you can block these addresses by running the following command: sudo iptables -A INPUT -s {ENTER_THE_IP_ADDRESS_HERE}/32 -p tcp -m multiport --dports 80,443 -m conntrack --ctstate NEW,ESTABLISHED -j DROP
  • Finally, you can also contact me by email (hi at benbernardblog.com) and include an URL to your website, so that I can filter you out right away from the crawling. Upon request, I will also destroy any data crawled from your website or service. This will take effect immediately.