Preventing Bots from Getting Counted in Django Web App

1411

When implementing hit or view counter in our Django web project, we might want to exclude bots or web crawlers. Let's try to do that manually. Consider a model that represents a blog article, in models.py:

from django.db import models
from django.utils import timezone

class Article(models.Model):
    title = models.CharField(max_length=1000)
    date_created = models.DateTimeField(auto_now_add=True)
    date_modified = models.DateTimeField(default=timezone.now, editable=False)
    view_count = models.IntegerField(default=0)
    # and other fields...

The value of view count (view_count) should be increased by one, only if:

  • signed cookie is not present,
  • the visitor is not authenticated (i.e., not the creator), and
  • it is not bot or crawler.

In views.py:

def article(request, year, slug):
    """Render an article."""
    art = get_object_or_404(Article, date_created__year=year, slug=slug)
    response = render(request, 'article.html', {
        'art': art,
    })

    # update article's view count using cookies
    cookie_not_present = request.COOKIES.get(cookie_name) is None
    user_non_admin = not request.user.is_authenticated
    user_agent_non_bot = not _is_user_agent_bot(request.META.get('HTTP_USER_AGENT', None))
    if cookie_not_present and user_non_admin and user_agent_non_bot:
        import datetime
        three_days = datetime.datetime.now() + datetime.timedelta(days=3)
        response.set_signed_cookie(cookie_name, cookie_value, expires=three_days)
        art.view_count = F('view_count') + 1
        art.save()

    return response

Set both cookie_name and cookie_value if you need them. In this example, the cookie will expire in three days.

Now the most important part: the _is_user_agent_bot() definition:

from django.conf import settings

def _is_user_agent_bot(user_agent):
    # method one:
    if user_agent in settings.BOT_LIST:
        return True
    # method two:
    import re
    for bot_name in settings.BOT_LIST:
        if re.search(bot_name.lower(), user_agent.lower()):
            return True
    return False

Alright. Two methods. I also used .lower() although this may not be necessary.

BOT_LIST config variable must be added to settings.py:

BOT_LIST = ('Teoma', 'alexa', 'froogle', 'Gigabot', 'inktomi',
    'looksmart', 'URL_Spider_SQL', 'Firefly', 'NationalDirectory', 'Ask Jeeves', 'TECNOSEEK', 'InfoSeek', 'WebFindBot', 'girafabot', 'crawler',
    'www.galaxy.com', 'Googlebot', 'Googlebot/2.1', 'Google', 'Webmaster',
    'Scooter', 'James Bond', 'Slurp', 'msnbot', 'appie', 'FAST', 'WebBug',
    'Spade', 'ZyBorg', 'rabaz', 'Baiduspider', 'Feedfetcher-Google',
    'TechnoratiSnoop', 'Rankivabot', 'Mediapartners-Google', 'Sogou web spider', 'WebAlta Crawler', 'MJ12bot', 'Yandex/', 'YaDirectBot',
    'StackRambler', 'DotBot')

I got the list from here. These are all names of bots, e.g., Google's googlebot and Yahoo's Slurp.

Of course, this can only work if the bot uses one of the listed strings, because HTTP User-Agent can be modified easily.

Komentar