how to prune data set?

Posted by sakura90 on Stack Overflow See other posts from Stack Overflow or by sakura90
Published on 2010-06-06T10:26:01Z Indexed on 2010/06/06 10:32 UTC
Read the original article Hit count: 298

Filed under:

The MovieLens data set provides a table with columns:

userid | movieid | tag | timestamp

I have trouble reproducing the way they pruned the MovieLens data set used in:

http://www.cse.ust.hk/~yzhen/papers/tagicofi-recsys09-zhen.pdf

In 4.1 Data Set of the above paper, it writes "For the tagging information, we only keep those tags which are added on at least 3 distinct movies. As for the users, we only keep those users who used at least 3 distinct tags in their tagging history. For movies, we only keep those movies that are annotated by at least 3 distinct tags."

I tried to query the database:

select TMP.userid, count(*) as tagnum from (select distinct T.userid as userid, T.tag as tag from tags T) AS TMP group by TMP.userid having tagnum >= 3;

I got a list of 1760 users who labeled 3 distinct tags. However, some of the tags are not added on at least 3 distinct movies.

Any help is appreciated.

© Stack Overflow or respective owner

Related posts about dataset