how to prune data set?
Posted
by sakura90
on Stack Overflow
See other posts from Stack Overflow
or by sakura90
Published on 2010-06-06T10:26:01Z
Indexed on
2010/06/06
10:32 UTC
Read the original article
Hit count: 324
dataset
The MovieLens data set provides a table with columns:
userid | movieid | tag | timestamp
I have trouble reproducing the way they pruned the MovieLens data set used in:
http://www.cse.ust.hk/~yzhen/papers/tagicofi-recsys09-zhen.pdf
In 4.1 Data Set of the above paper, it writes "For the tagging information, we only keep those tags which are added on at least 3 distinct movies. As for the users, we only keep those users who used at least 3 distinct tags in their tagging history. For movies, we only keep those movies that are annotated by at least 3 distinct tags."
I tried to query the database:
select TMP.userid, count(*) as tagnum from (select distinct T.userid as userid, T.tag as tag from tags T) AS TMP group by TMP.userid having tagnum >= 3;
I got a list of 1760 users who labeled 3 distinct tags. However, some of the tags are not added on at least 3 distinct movies.
Any help is appreciated.
© Stack Overflow or respective owner