How to calculate real-time stats?

Posted by Diego Jancic on Stack Overflow See other posts from Stack Overflow or by Diego Jancic
Published on 2012-06-01T20:30:18Z Indexed on 2012/06/02 4:40 UTC
Read the original article Hit count: 278

I have a site with millions of users (well, actually it doesn't have any yet, but let's imagine), and I want to calculate some stats like "log-ins in the past hour".

The problem is similar to the one described here: http://highscalability.com/blog/2008/4/19/how-to-build-a-real-time-analytics-system.html

The simplest approach would be to do a select like this:

select count(distinct user_id) 
from logs
where date>='20120601 1200' and date <='20120601 1300' 

(of course other conditions could apply for the stats, like log-ins per country) Of course this would be really slow, mainly if it has millions (or even thousands) of rows, and I want to query this every time a page is displayed.

How would you summarize the data? What should go to the (mem)cache?

EDIT: I'm looking for a way to de-normalize the data, or to keep the cache up-to-date. For example I could increment an in-memory variable every time someone logs in, but that would help to know the total amount of logins, not the "logins in the last hour". Hope it's more clear now.

© Stack Overflow or respective owner

Related posts about Performance

Related posts about memcached