In this post I’d like to create demo realtime Stack Overflow mirror with Celery, CherryPy and MongoDB. By realtime I mean that app will fetch results from remote resource in short intervals, and it will display results in simple one page js-html app without user clicking browser refresh button.
** All of the code for this tutorial is placed in my blog’s github account.
Design for the whole project is quite simple. First we’ll create basic HTTP client that will connect to Stack Overflow xml feed and parse results. The client itself will be synchronous, created with python-requests, but it will be executed as periodic task running with Celery beat scheduler. It will run at regular intervals, check if there are new questions in SO, if there are, it will insert them into database.
To this I’ll add simple REST-ful backend that will return results in JSON. We’ll have one endpoint /update. It will accept one parameter ‘timestamp’, and will return all results fetched from Stack Overflow after time designated by timestamp. I’m going to use CherryPy because it’s simple and easy. CherryPy has really gentle learning curve, if you know some Python you can get up and running in matter of minutes, design of framework seems intuitive, it does not enforce any design paradigm and gives you freedom to do what you’d like to do.
Finally I’ll add some frontend to whole mixture - trivial JS script polling our /update endpoint and appending (or actually prepending) results to DOM. I’m going to use poling instead of websockets, because it’s a bit easier to start with polling, you remain on the level of simple HTTP GET without having to setup websockets server.
Simple Stack Overflow Scraper
Fist let’s write a client that will parse Stack Overflow feed and get all new questions for us. Recent questions feed is located at: http://stackoverflow.com/feeds, it’s plain rss xml, that we can easily parse by using xpaths. If you prefer BeautifulSoup or some other library, nothing should stop you from using it! I prefer xpaths only because I use them quite often so I’m familiar with them.
The script simply visits feed and extracts title, link and author of the post, it then stores this data into MongoDB. We use hash of link as object id to ensure that duplicate records are not inserted into collection. When you try to insert duplicate id mongodb will raise exception. If this happens we know that we encountered post that we already have in database, and we can safely stop parsing remaining questions.
You can call ‘questions’ function, run it normally and perhaps print some results to see if it works ok.
Scheduling our client at regular intervals
Now we would like to be able to run our script at regular intervals. As usual there are many ways to do this. You could set it up as cron job, you could use Python’s time.sleep(). I’m going to use Celery. Celery is an asynchronous task runner, it allows you to turn your function into a task that will be executed in the background. It will nicely handle all problems with your script, it can retry task, report problems log what happens etc. Running your process in the background and having something that manages is properly is huge benefit, your server app can just forget about this task, it can do its thing as it normally does without minding task running in the background.
Turning our Stack scraper into Celery task is easy, we just need to create Celery app instance and decorate our task with Celery ‘task’ decorator.
Now we need to add Celerybeat schedule that will ensure that our task is scheduled every 30 seconds. We’ll use following Celery config:
You need to call it with
You should see in logs that Celery is up and running, scheduling task at regular invervals:
If you open mongo shell and check yout ‘questions’ collection in ‘stack_questions’ database you’ll see new posts inserted.
Create web app
We now have a script that pings Stack Overflow and checks if there are new questions in xml feed. It’s time to actually display results in a browser.
First we need a server that will server some static assets (our index.html and js) and will return posts from database. This can be written with CherryPy in a matter of minutes, what is cool about CherryPy is that it looks like plain old python, it doesn’t read like a framework at all.
You can start our app just like you’d run any other python script, this is all you need to start it
Our client-side code will send ajax GET request to /update endpoint with timestamp as sole parameter. When the page first loads timestamp will be set to zero and script will fetch all results from database. After fetching results it will append them to DOM and add ‘modified’ attribute to div. In subsequent calls it will take value of ‘modified’ attribute and use it to query server. So our JS should essentialy say something like: “hey, server, give me all results fetched after I last updated DOM”. If server doesn’t have anything it will respond with blank answer and script will do nothing, if there are some new questions fetched by our celery stack scraper it will append them to DOM, and refresh ‘modified’ atribute.
Polling part will look like this:
We’ll use jQuery always so that the code will set timeouts even in case of failures.
Part appending to DOM is rather typical, you could use some js templates, like Mustache to make code cleaner and more readable, generating DOM from string is probably bad practice but we’ll do this here for the sake of simplicity.
At this point it’s ready, you should start your celery scraper, launch python site, and you’ll see SO questions displayed.