Nanotwitter

Nano Version of Twitter

View project on GitHub

Shraddha Basnyat, Viet Le, Subahu Rayamajhi, Dewar Tan

Introduction

This is a nano version of Twitter that we built for COSI 105B: Software Development for Scability instructed by Pito Salas at Brandeis University.

The main purpose of this project is to apply scaling techniques that we learned in class to a live project under controlled environments.

Overview

Following the project guidelines, our NanoTwitter has implemented these functionalities:

  • Users:

    • can register for an account by supplying an email, an username, a full name and a password
    • search for tweets
  • Logged in users

    • Can follow and unfollow other registered users
    • Can tweet
    • Can see the flow of the last n tweets by the users that they have followed
    • Can favorite and retweet any tweet
  • Non-logged in users

    • See the flow of the last n tweets by any user
  • Tweets

    • Consist of
      • a 140 characters of text (HTML escape)
      • a date-time of creation
    • Belong to one user

We also provide REST API:

  • All start with /api/v1
  • Tweets:

    • /tweets/id
    • /tweets/recent
  • Users:

    • /users/id/tweets
    • /users/id/followers
    • /users/id/following
    • /users/id/retweets
    • /users/id/favorites

The below screenshots represent our latest version of the application.

  1. Homepage:
  2. homepage

  3. Signup page:
  4. signup

  5. User page
  6. user

  7. Search results:
  8. search

Load testing

We used loader.io to benchmark our app, connecting 500 users over the course of 1 minute.

We did 3 main load tests:

  1. GET / - load up homepage

  2. GET /user/testuser - load up testuser homepage

  3. POST /user/testuser/tweet - have testuser create 1 tweet

Our final results are as followings:

  1. Load Homepage, 2000 visits
  2. Conditions: users=100 tweets=500 followers=30
    Conditions: users=500 tweets=500 followers=100
    Conditions: users=3000 tweets=2000 followers=1000
  3. Load Test User, 1800 visits
  4. Conditions: users=100 tweets=500 followers=30
    Conditions: users=500 tweets=500 followers=100
    Conditions: users=3000 tweets=2000 followers=1000
  5. Testuser creates 3000 tweets
  6. Conditions: users=100 tweets=500 followers=30
    Conditions: users=500 tweets=500 followers=100
    Conditions: users=3000 tweets=2000 followers=1000

Technology

Framework

The app was build on top of Sinatra, a Ruby web framework, in modular style. A major advantage of writing this app in modular style is that each controller exists as an isolated Sinatra application.

We then use Rack Middleware to mount them to handle the corresponding URL requests.

Web Server

We migrated from Ruby default web server, WEBrick to Puma to take advantage of multi threads. In our testing, we couldn't find any noticeable difference between the two.

Database

We used PostgreSQL for storing persistent data.

Caching

We used Redis as caching backend. On Heroku, we had 3 Redis stores; one for database query caching, one for fragment caching and the last one for sidekiq.

The major improvement in response time came from fragment caching. We put the HTML for a Tweet object in Redis store so each time this object was called, we got the HTML from Redis. This removed the overhead of connecting to database to get the Tweet object.

In one of our load testing, connecting 100 clients to the homepage over the course of 1 minute, the average response time before fragment caching was implemented was 146ms. After the fragment caching was introduced, the average time was reduced to 98ms.

We also experimented with page caching, storing the entire page in Redis. This was implemented using Rack Cache. Our homepage had 2 additional HTTP headers, Etag and Last-Modified, to be used by Rack Cache to know if the page was updated or not. If there was no change, our old HTML page was served directly from Redis. Otherwise, a new page was rendered and stored in Redis for future usage.

Our homepage benefited a lot from full page caching. However, one bug we encountered was that a logged in user was served the cached version of homepage for non logged in users (without ability to post new tweet and logout). We hadn't been able to fix this bug so full page caching hadn't been implemented in other pages.

Background jobs

We used a lot of callbacks on our application to fill the Timeline table. Every time an user followed another, the former timeline table would be updated with most 50 recent tweets by the latter. Every time the latter tweeted, all of their followers would get the new tweet on their timeline. The major drawback of this implementation was that there were a lot of database writings. Similar to scaling Twitter slides, our application had the cost of O(1) read and O(n) write.

Background jobs were implemented to reduce the costs of writing. The replication of tweets to all followers were done in the background without affecting the end-users. We used Sidekiq gem to achieve this.

We made an assumption that only a small percentage of users were active content creators, and the rest were content consumers. So the cost of writing was reasonably acceptable to us. However, due to this implementation, it took a long time to setup test cases for load testing. We found that, it was faster to drop and create new database than calling /test/reset/all which tried to destroy every users, tweets, follows and timelines.

Miscellaneous

CDN

Instead of serving static assets from Heroku, we took advantage of public Content delivery networks (CDN) to serve jQuery and Bootstrap.

Compression

We used Rack::Deflater, a Rack middleware which intercepted responses, compressing them before sending them back to users. Smaller files resulted in faster response time for end users.

What could be done better

We used PostgreSQL on Heroku and SQlite locally because the latter was easy to setup on our local machines. SQLite lacked a lot of bells and whistles which the former had. Because production and development environments were different, we didn't invest enough time on improving PostgreSQL

One way of improving database performance was to replace ActiveRecord callbacks with PostgreSQL triggers. In our app, we used callbacks to replicate tweets from one user to all their followers. Each callback connected to the database and ran queries. Implementing this on PostgreSQL can be faster as we eliminate the need to connect to database. Additionally, queries running internally would be faster.

Sidekiq jobs run asynchronously, and non-sequentially. There are libraries to make background jobs run sequentially. Using these, we can have a single URL for creating a test case (reset testuser, seed tweets, seed users, seed followings...), instead of creating 1 job, checking status page to make sure that it is done before running another.