What is the best way to store a table in C++
- by Topo
I'm programming a decision tree in C++ using a slightly modified version of the C4.5 algorithm. Each node represents an attribute or a column of your data set and it has a children per possible value of the attribute.
My problem is how to store the training data set having in mind that I have to use a subset for each node so I need a quick way to only select a subset of rows and columns.
The main goal is to do it in the most memory and time efficient possible (in that order of priority).
The best way I have thought of is to have an array of arrays (or std::vector), or something like that, and for each node have a list (array, vector, etc) or something with the column,line(probably a tuple) pairs that are valid for that node.
I now there should be a better way to do this, any suggestions?
UPDATE: What I need is something like this:
In the beginning I have this data:
Paris    4    5.0    True
New York 7    1.3    True
Tokio    2    9.1    False
Paris    9    6.8    True
Tokio    0    8.4    False
But for the second node I just need this data:
Paris    4    5.0
New York 7    1.3
Paris    9    6.8
And for the third node:
Tokio    2    9.1
Tokio    0    8.4
But with a table of millions of records with up to hundreds of columns.
What I have in mind is keep all the data in a matrix, and then for each node keep the info of the current columns and rows. Something like this:
Paris    4    5.0    True
New York 7    1.3    True
Tokio    2    9.1    False
Paris    9    6.8    True
Tokio    0    8.4    False
Node 2:
columns = [0,1,2]
rows = [0,1,3]
Node 3:
columns = [0,1,2]
rows = [2,4]
This way on the worst case scenario I just have to waste 
size_of(int) * (number_of_columns + number_of_rows) * node
That is a lot less than having an independent data matrix for each node.