We have already discussed feature selection and features reduction, and all the techniques we explored were strictly related and applicable only in the text mining domain.
We have already discussed feature selection and features reduction, and all the techniques we explored were strictly related and applicable only in the text mining domain.
The method is known as coefficients of constraint (Coombs, Dawes and Tversky 1970) or uncertainty coefficient (Press & Flannery 1988) and it is based on the Mutual Information concept.
Mutual Information
As you know, the Mutual Information between two random variables X and Y I(X,Y) measures the uncertainty of X knowing Y, and it says how much the uncertainty (entropy) of X is reduced by Y.
This is more clear looking at the below formula:
- If I use the SSN to discriminate the people belonging to the red set from the people belonging to blue set, I can achieve 100% of accuracy because the classifier will not find any overlapping between different people.
- Using the SSN as predictor in a new data set never seen before by the classifier, the results will be catastrophic!
- The entropy of such variable is extremely high, because it is almost a uniform distributed variable!
The key point is: the SSN variable could have a great I value but it is dramatically useless to classification job.
- I choose as training set 20% of the docs.
- from the above training set I extracted (after stemming and filtering process) all the words and I used them to build the boolean vectors.
- I ranked the words through the uncertainty coefficient.
- I extracted the first 60 features: that is only 0.38% of the original feature space
- I trained an SVM with a gaussian kernel and very high value of C
- I tested over the remaining 80% of the data set.
Before the results, let me show you some graph about the feature ranking.
![]() |
Entropy of the first 3000 features. |
The above graph shows the entropy of the first 3000 features sorted by TF-DF score.
As you can notice, the features having low score have low entropy: it happens because these features are present in really few documents so the distribution follows a Bernulli’s distribution having “p” almost equal to 0: basically the uncertainty of the variable is very small.
Here you are the final ranking of the first 2500 features.
![]() |
Uncertainty coefficient for all features. |
![]() |
Accuracy comparison: the red circle represents the accuracy obtained training an SVM with the features extracted through the Uncertainty Coefficients. |
Leave a Review