PyML 0.7.2 - How to prevent accuracy from dropping after storing/loading a classifier?

Posted by Michael Aaron Safyan on Stack Overflow See other posts from Stack Overflow or by Michael Aaron Safyan
Published on 2010-04-22T03:04:49Z Indexed on 2010/04/28 7:03 UTC
Read the original article Hit count: 375

Filed under:
|
|
|
|

This is a followup from "Save PyML.classifiers.multi.OneAgainstRest(SVM()) object?". The solution to that question was close, but not quite right, (the SparseDataSet is broken, so attempting to save/load with that dataset container type will fail, no matter what. Also, PyML is inconsistent in terms of whether labels should be numbers or strings... it turns out that the oneAgainstRest function is actually not good enough, because the labels need to be strings and simultaneously convertible to floats, because there are places where it is assumed to be a string and elsewhere converted to float) and so after a great deal of hacking and such I was finally able to figure out a way to save and load my multi-class classifier without it blowing up with an error.... however, although it is no longer giving me an error message, it is still not quite right as the accuracy of the classifier drops significantly when it is saved and then reloaded (so I'm still missing a piece of the puzzle).

I am currently using the following custom mutli-class classifier for training, saving, and loading:

class SVM(object):
    def __init__(self,features_or_filename,labels=None,kernel=None):
        if isinstance(features_or_filename,str):
            filename=features_or_filename;
            if labels!=None:
                raise ValueError,"Labels must be None if loading from a file.";
            with open(os.path.join(filename,"uniquelabels.list"),"rb") as uniquelabelsfile:
                self.uniquelabels=sorted(list(set(pickle.load(uniquelabelsfile))));
                self.labeltoindex={};
                for idx,label in enumerate(self.uniquelabels):
                    self.labeltoindex[label]=idx;
            self.classifiers=[];
            for classidx, classname in enumerate(self.uniquelabels):
                self.classifiers.append(PyML.classifiers.svm.loadSVM(os.path.join(filename,str(classname)+".pyml.svm"),datasetClass = PyML.VectorDataSet));
        else:
            features=features_or_filename;
            if labels==None:
                raise ValueError,"Labels must not be None when training.";
            self.uniquelabels=sorted(list(set(labels)));
            self.labeltoindex={};
            for idx,label in enumerate(self.uniquelabels):
                self.labeltoindex[label]=idx;
            points = [[float(xij) for xij in xi] for xi in features];
            self.classifiers=[PyML.SVM(kernel) for label in self.uniquelabels];
            for i in xrange(len(self.uniquelabels)):
                currentlabel=self.uniquelabels[i];
                currentlabels=['+1' if k==currentlabel else '-1' for k in labels];
                currentdataset=PyML.VectorDataSet(points,L=currentlabels,positiveClass='+1');
                self.classifiers[i].train(currentdataset,saveSpace=False);

    def accuracy(self,pts,labels):
        logger=logging.getLogger("ml");
        correct=0;
        total=0;
        classindexes=[self.labeltoindex[label] for label in labels];
        h=self.hypotheses(pts);
        for idx in xrange(len(pts)):
            if h[idx]==classindexes[idx]:
                logger.info("RIGHT: Actual \"%s\" == Predicted \"%s\"" %(self.uniquelabels[ classindexes[idx] ], self.uniquelabels[ h[idx] ])); 
                correct+=1;
            else:
                logger.info("WRONG: Actual \"%s\" != Predicted \"%s\"" %(self.uniquelabels[ classindexes[idx] ], self.uniquelabels[ h[idx] ]))
            total+=1;
        return float(correct)/float(total);

    def prediction(self,pt):
        h=self.hypothesis(pt);
        if h!=None:
            return self.uniquelabels[h];
        return h;

    def predictions(self,pts):
        h=self.hypotheses(self,pts);
        return [self.uniquelabels[x] if x!=None else None for x in h];

    def hypothesis(self,pt):
        bestvalue=None;
        bestclass=None;
        dataset=PyML.VectorDataSet([pt]);
        for classidx, classifier in enumerate(self.classifiers):
            val=classifier.decisionFunc(dataset,0);
            if (bestvalue==None) or (val>bestvalue):
                bestvalue=val;
                bestclass=classidx;
        return bestclass;

    def hypotheses(self,pts):
        bestvalues=[None for pt in pts];
        bestclasses=[None for pt in pts];
        dataset=PyML.VectorDataSet(pts);
        for classidx, classifier in enumerate(self.classifiers):
            for ptidx in xrange(len(pts)):
                val=classifier.decisionFunc(dataset,ptidx);
                if (bestvalues[ptidx]==None) or (val>bestvalues[ptidx]):
                    bestvalues[ptidx]=val;
                    bestclasses[ptidx]=classidx;
        return bestclasses;

    def save(self,filename):
        if not os.path.exists(filename):
            os.makedirs(filename);
        with open(os.path.join(filename,"uniquelabels.list"),"wb") as uniquelabelsfile:
                pickle.dump(self.uniquelabels,uniquelabelsfile,pickle.HIGHEST_PROTOCOL);
        for classidx, classname in enumerate(self.uniquelabels):
            self.classifiers[classidx].save(os.path.join(filename,str(classname)+".pyml.svm"));

I am using the latest version of PyML (0.7.2, although PyML.__version__ is 0.7.0). When I construct the classifier with a training dataset, the reported accuracy is ~0.87. When I then save it and reload it, the accuracy is less than 0.001. So, there is something here that I am clearly not persisting correctly, although what that may be is completely non-obvious to me. Would you happen to know what that is?

© Stack Overflow or respective owner

Related posts about python

Related posts about libsvm