how to find maximum frequent item sets from large transactional data file

Posted by ANIL MANE on Stack Overflow See other posts from Stack Overflow or by ANIL MANE
Published on 2010-04-24T11:45:14Z Indexed on 2010/04/24 11:53 UTC
Read the original article Hit count: 341

Filed under:
|
|
|
|

Hi,

I have the input file contains large amount of transactions like

Transaction ID Items

T1 Bread, milk, coffee, juice

T2 Juice, milk, coffee

T3 Bread, juice

T4 Coffee, milk

T5 Bread, Milk

T6 Coffee, Bread

T7 Coffee, Bread, Juice

T8 Bread, Milk, Juice

T9 Milk, Bread, Coffee,

T10 Bread

T11 Milk

T12 Milk, Coffee, Bread, Juice

i want the occurrence of every unique item like

Item Name Count Bread 9 Milk 8 Coffee 7 Juice 6

alt text

and from that i want an a fp-tree now by traversing this tree i want the maximal frequent itemsets as follows

The basic idea of method is to dispose nodes in each “layer” from bottom to up. The concept of “layer” is different to the common concept of layer in a tree. Nodes in a “layer” mean the nodes correspond to the same item and be in a linked list from the “Head Table”. For nodes in a “layer” NBN method will be used to dispose the nodes from left to right along the linked list. To use NBN method, two extra fields will be added to each node in the ordered FP-Tree. The field tag of node N stores the information of whether N is maximal frequent itemset, and the field count’ stores the support count information in the nodes at left.

In Figure, the first node to be disposed is “juice: 2”. If the min_sup is equal to or less than 2 then “bread, milk, coffee, juice” is a maximal frequent itemset. Firstly output juice:2 and set the field tag of “coffee:3” as “false” (the field tag of each node is “true” initially ). Next check whether the right four itemsets juice:1 be the subset of juice:2. If the itemset one node “juice:1” corresponding to is the subset of juice:2 set the field tag of the node “false”. In the following process when the field tag of the disposed node is FALSE we can omit the node after the same tagging. If the min_sup is more than 2 then check whether the right four juice:1 is the subset of juice:2. If the itemset one node “juice:1” corresponding to is the subset of juice:2 then set the field count’ of the node with the sum of the former count’ and 2 After all the nodes “juice” disposed ,begin to dispose the node “coffee:3”.

Any suggestions or available source code, welcome.

thanks in advance

© Stack Overflow or respective owner

Related posts about data-mining

Related posts about c#