Preventing Bots from Getting Counted in Django Web App
When implementing hit or view counter in our Django web project, we might want to exclude bots or web crawlers. Let's try to do that manually. Consider a model that represents a blog article, in models.py:
from django.db import models
from django.utils import timezone
class Article(models.Model):
title = models.CharField(max_length=1000)
date_created = models.DateTimeField(auto_now_add=True)
date_modified = models.DateTimeField(default=timezone.now, editable=False)
view_count = models.IntegerField(default=0)
# and other fields...
The value of view count (view_count
) should be increased by one, only if:
- signed cookie is not present,
- the visitor is not authenticated (i.e., not the creator), and
- it is not bot or crawler.
In views.py:
def article(request, year, slug):
"""Render an article."""
art = get_object_or_404(Article, date_created__year=year, slug=slug)
response = render(request, 'article.html', {
'art': art,
})
# update article's view count using cookies
cookie_not_present = request.COOKIES.get(cookie_name) is None
user_non_admin = not request.user.is_authenticated
user_agent_non_bot = not _is_user_agent_bot(request.META.get('HTTP_USER_AGENT', None))
if cookie_not_present and user_non_admin and user_agent_non_bot:
import datetime
three_days = datetime.datetime.now() + datetime.timedelta(days=3)
response.set_signed_cookie(cookie_name, cookie_value, expires=three_days)
art.view_count = F('view_count') + 1
art.save()
return response
Set both cookie_name
and cookie_value
if you need them. In this example, the cookie will expire in three days.
Now the most important part: the _is_user_agent_bot()
definition:
from django.conf import settings
def _is_user_agent_bot(user_agent):
# method one:
if user_agent in settings.BOT_LIST:
return True
# method two:
import re
for bot_name in settings.BOT_LIST:
if re.search(bot_name.lower(), user_agent.lower()):
return True
return False
Alright. Two methods. I also used .lower()
although this may not be necessary.
BOT_LIST
config variable must be added to settings.py:
BOT_LIST = ('Teoma', 'alexa', 'froogle', 'Gigabot', 'inktomi',
'looksmart', 'URL_Spider_SQL', 'Firefly', 'NationalDirectory', 'Ask Jeeves', 'TECNOSEEK', 'InfoSeek', 'WebFindBot', 'girafabot', 'crawler',
'www.galaxy.com', 'Googlebot', 'Googlebot/2.1', 'Google', 'Webmaster',
'Scooter', 'James Bond', 'Slurp', 'msnbot', 'appie', 'FAST', 'WebBug',
'Spade', 'ZyBorg', 'rabaz', 'Baiduspider', 'Feedfetcher-Google',
'TechnoratiSnoop', 'Rankivabot', 'Mediapartners-Google', 'Sogou web spider', 'WebAlta Crawler', 'MJ12bot', 'Yandex/', 'YaDirectBot',
'StackRambler', 'DotBot')
I got the list from here. These are all names of bots, e.g., Google's googlebot and Yahoo's Slurp.
Of course, this can only work if the bot uses one of the listed strings, because HTTP User-Agent can be modified easily.