





This is my OLD blog. I've copied this post over to my NEW blog at:

http://www.saltycrane.com/blog/2008/07/django-blog-project-9-migrating-blogger/

You should be redirected in 2 seconds.

Last post, I talked about adding comments to my new sample blog application. This was about the last basic feature I needed to add before I started actually using it for real. Of course there are still a number of features I'd like to add, such as automatic syntax highlighting with Pygments, and incorporating django-tagging and some more intersting views, not to mention comment moderation. But I think those will have to wait-- I want to start using my new blog for real sometime.

So for the past few days, I've been working on my Beautiful Soup screen scraper script to copy all my Blogger posts over to my new Django blog. Initial results came quickly (it's pretty cool to see such a huge data dump after only a few lines of Beautiful Soup'ing) but the details (especially with the comments) kind of slowed me down. I've finally got everything copied over to my satisfaction. Below is the script I used to do it. Note, I realize it's not pretty-- just a one time use hack. But hopefully someone else doing the same thing might find it useful.

#!/usr/bin/env python import datetime import os import re import urllib2 from BeautifulSoup import BeautifulSoup from myblogapp.models import Post, LegacyComment from django.contrib.comments.models import FreeComment URL = ''.join([ 'http://iwiwdsmi.blogspot.com/search?', 'updated-min=2006-01-01T00%3A00%3A00-08%3A00&' 'updated-max=2009-01-01T00%3A00%3A00-08%3A00&', 'max-results=1000' ]) html = urllib2.urlopen(URL).read() soup = BeautifulSoup(html) for post in soup.html.body.findAll('div', {'class': 'post'}): print print '--------------------------------------------------------------' # save the post title and permalink h3 = post.find('h3', {'class': 'post-title'}) post_href = h3.find('a')['href'] post_title = h3.find('a').string post_slug = os.path.basename(post_href).rstrip('.html') print post_slug print post_href print post_title # save the post body div = post.find('div', {'class': 'post-body'}) [toremove.extract() for toremove in div.findAll('script')] [toremove.extract() for toremove in div.findAll('span', {'id': 'showlink'})] [toremove.extract() for toremove in div.findAll('div', {'style': 'clear: both;'})] [toremove.parent.extract() for toremove in div.findAll(text='#fullpost{display:none;}')] post_body = ''.join([str(item) for item in div.contents ]).rstrip() post_body = re.sub(r"iwiwdsmi\.blogspot\.com/(\d{4}/\d{2}/[\w\-]+)\.html", r"www.saltycrane.com/blog/\1/", post_body) # count number of highlighted code sections highlight = div.findAll('div', {'class': 'highlight'}) if highlight: hl_count += len(highlight) hl_list.append(post_title) # save the timestamp a = post.find('a', {'class': 'timestamp-link'}) try: post_timestamp = a.string except: match = re.search(r"\.com/(\d{4})/(\d{2})/", post_href) if match: year = match.group(1) month = match.group(2) post_timestamp = "%s/01/%s 11:11:11 AM" % (month, year) print post_timestamp # save the tags (this is ugly, i know) if 'error' in post_title.lower(): post_tags = ['error'] else: post_tags = [] span = post.find('span', {'class': 'post-labels'}) if span: a = span.findAll('a', {'rel': 'tag'}) else: a = post.findAll('a', {'rel': 'tag'}) post_tags = ' '.join([tag.string for tag in a] + post_tags) if not post_tags: post_tags = 'untagged' print post_tags # add Post object to new blog if True: p = Post() p.title = post_title p.body = post_body p.date_created = datetime.datetime.strptime(post_timestamp, "%m/%d/%Y %I:%M:%S %p") p.date_modified = p.date_created p.tags = post_tags p.slug = post_slug p.save() # check if there are comments a = post.find('a', {'class': 'comment-link'}) if a: comm_string = a.string.strip() else: comm_string = "0" if comm_string[0] != "0": print print "COMMENTS:" # get the page with comments html_single = urllib2.urlopen(post_href).read() soup_single = BeautifulSoup(html_single) # get comments comments = soup_single.html.body.find('div', {'class': 'comments'}) cauth_list = comments.findAll('dt') cbody_list = comments.findAll('dd', {'class': 'comment-body'}) cdate_list = comments.findAll('span', {'class': 'comment-timestamp'}) if not len(cauth_list)==len(cbody_list)==len(cdate_list): raise "didn't get all comment data" for auth, body, date in zip(cauth_list, cbody_list, cdate_list): # create comment in database lc = LegacyComment() lc.body = str(body.p) # find author lc.author = "Anonymous" auth_a = auth.findAll('a')[-1] auth_no_a = auth.contents[2] if auth_a.string: lc.author = auth_a.string elif auth_no_a: match = re.search(r"\s*([\w\s]*\w)\s+said", str(auth_no_a)) if match: lc.author = match.group(1) print lc.author # find website try: lc.website = auth_a['href'] except KeyError: lc.website = '' print lc.website # other info lc.date_created = datetime.datetime.strptime( date.a.string.strip(), "%m/%d/%Y %I:%M %p") print lc.date_created lc.date_modified = lc.date_created lc.post_id = p.id lc.save()

I also made some changes to my Django blog code as I migrated my Blogger posts. The main addition was a LegacyComment model along with the associated views and templates. My Blogger comments consisted of HTML markup, but I didn't want to allow arbitrary HTML in my new comments for fear of cross site scripting. So I separated my legacy Blogger comments from my new Django site comments.

Here are my model changes. I added a LegacyComment class which contains pertinent comment attributes and a ForeignKey to the post that it belongs to. I also added a lc_count (for legacy comment count) field to the Post class which stores the number of comments for the post. It is updated by the save() method in the LegacyComment class every time a comment is saved. Hmmm, I just realized the count will be wrong if I ever edit these comments. Well, since these are legacy comments, hopefully I won't have to edit them.

~/src/django/myblogsite/myblogapp/models.py

import re from django.db import models class Post(models.Model): title = models.CharField(maxlength=200) slug = models.SlugField(maxlength=100) date_created = models.DateTimeField() #auto_now_add=True) date_modified = models.DateTimeField() tags = models.CharField(maxlength=200) body = models.TextField() body_html = models.TextField(editable=False, blank=True) lc_count = models.IntegerField(default=0, editable=False) def get_tag_list(self): return re.split(" ", self.tags) def get_absolute_url(self): return "/blog/%d/%02d/%s/" % (self.date_created.year, self.date_created.month, self.slug) def __str__(self): return self.title class Meta: ordering = ["-date_created"] class Admin: pass class LegacyComment(models.Model): author = models.CharField(maxlength=60) website = models.URLField(core=False) date_created = models.DateTimeField() date_modified = models.DateTimeField() body = models.TextField() post = models.ForeignKey(Post) def save(self): p = Post.objects.get(id=self.post.id) p.lc_count += 1 p.save() super(LegacyComment, self).save() class Meta: ordering = ["date_created"] class Admin: pass

Here is an excerpt from my views.py file showing the changes:

~/src/django/myblogsite/myblogapp/views.py

import re from datetime import datetime from django.shortcuts import render_to_response from myblogsite.myblogapp.models import Post, LegacyComment MONTH_NAMES = ('', 'January', 'Feburary', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December') MAIN_TITLE = "Sofeng's Blog 0.0.7" def frontpage(request): posts, pagedata = init() posts = posts[:5] pagedata.update({'post_list': posts, 'subtitle': '',}) return render_to_response('listpage.html', pagedata) def singlepost(request, year, month, slug2): posts, pagedata = init() post = posts.get(date_created__year=year, date_created__month=int(month), slug=slug2,) legacy_comments = LegacyComment.objects.filter(post=post.id) pagedata.update({'post': post, 'lc_list': legacy_comments,}) return render_to_response('singlepost.html', pagedata)

In the list page template I used the truncatewords_html template filter to show a 50 word post summary on the list pages instead of the full post. I also added the legacy comment count with the Django free comment count to display the total number of comments.

~/src/django/myblogsite/templates/listpage.html

{% block main %} <br> {% for post in post_list %} <h4><a href="/blog/{{ post.date_created|date:"Y/m" }}/{{ post.slug }}/"> {{ post.title }}</a> </h4> {{ post.body |truncatewords_html:"50" }} <a href="{{ post.get_absolute_url }}">Read more...</a><br> <br> <hr> <div class="post_footer"> {% ifnotequal post.date_modified.date post.date_created.date %} Last modified: {{ post.date_modified.date }}<br> {% endifnotequal %} Date created: {{ post.date_created.date }}<br> Tags: {% for tag in post.get_tag_list %} <a href="/blog/tag/{{ tag }}/">{{ tag }}</a>{% if not forloop.last %}, {% endif %} {% endfor %} <br> {% get_free_comment_count for myblogapp.post post.id as comment_count %} <a href="{{ post.get_absolute_url }} #comments"> {{ comment_count|add:post.lc_count }} Comment{{ comment_count|add:post.lc_count|pluralize}}</a> </div> <br> {% endfor %} {% endblock %}

Excerpt from

In the single post template, I added the display of the Legacy comments in addition to the Django free comments.

~/src/django/myblogsite/templates/singlepost.html

<a name="comments"></a> {% if lc_list %} <h4>{{ lc_list|length }} Legacy Comment{{lc_list|length|pluralize}}</h4> {% endif %} {% for legacy_comment in lc_list %} <br> <a name="lc{{ legacy_comment.id }}" href="#lc{{ legacy_comment.id }}"> #{{ forloop.counter }}</a> {% if legacy_comment.website %} <a href="{{ legacy_comment.website }}"> <b>{{ legacy_comment.author|escape }}</b></a> {% else %} <b>{{ legacy_comment.author|escape }}</b> {% endif %} commented, on {{ legacy_comment.date_created|date:"F j, Y" }} at {{ legacy_comment.date_created|date:"P" }}: {{ legacy_comment.body }} {% endfor %} <br>

Excerpt from

That's it. Hopefully, I can start using my new blog soon. Please browse around on the new Django site and let me know if you run across any problems. When everything looks to be OK, I'll start posting only on my new Django site.

Here is a snapshot screenshot of version 0.0.8:

The live site can be viewed at: http://saltycrane.com/blog

Related posts: