Speeding up inner-joins and subqueries while restricting row size and table membership

Posted by hiffy on Stack Overflow See other posts from Stack Overflow or by hiffy
Published on 2010-03-13T23:31:54Z Indexed on 2010/03/13 23:35 UTC
Read the original article Hit count: 277

I'm developing an rss feed reader that uses a bayesian filter to filter out boring blog posts.

The Stream table is meant to act as a FIFO buffer from which the webapp will consume 'entries'. I use it to store the temporary relationship between entries, users and bayesian filter classifications.

After a user marks an entry as read, it will be added to the metadata table (so that a user isn't presented with material they have already read), and deleted from the stream table. Every three minutes, a background process will repopulate the Stream table with new entries (i.e. whenever the daemon adds new entries after the checks the rss feeds for updates).

Problem: The query I came up with is hella slow. More importantly, the Stream table only needs to hold one hundred unread entries at a time; it'll reduce duplication, make processing faster and give me some flexibility with how I display the entries.

The query (takes about 9 seconds on 3600 items with no indexes):

insert into stream(entry_id, user_id) select entries.id, subscriptions_users.user_id from entries inner join subscriptions_users on subscriptions_users.subscription_id = entries.subscription_id where subscriptions_users.user_id = 1 and entries.id not in (select entry_id from metadata where metadata.user_id = 1) and entries.id not in (select entry_id from stream where user_id = 1);

The query explained: insert into stream all of the entries from a user's subscription list (subscriptions_users) that the user has not read (i.e. do not exist in metadata) and which do not already exist in the stream.

Attempted solution: adding limit 100 to the end speeds up the query considerably, but upon repeated executions will keep on adding a different set of 100 entries that do not already exist in the table (with each successful query taking longer and longer).

This is close but not quite what I wanted to do.

Does anyone have any advice (nosql?) or know a more efficient way of composing the query?

© Stack Overflow or respective owner

Related posts about sql

Related posts about database