Capturing the Financial Facts

So far, we have seen the data mining part on analyzing the financial markets and some of the problems that arise during such an analysis : Data have to be collected and pre-processed accordingly. There are dangers of over-fitting and the analyst must make sure that the model(s) created have the expected quality. The analyst has also to choose relevant attributes with which the analysis will be performed and how the training of the algorithms …

The markets react to financial news and there is no question about this. Of course there are other factors that make people buy or sell : For example if a stock price has hit a support or resistance level then some investors are going to either buy or sell when such a price level is reached. Investors are also going to buy or sell when specific technical indicators such as MACD or oscillators show the signals to do so. Even when bad news are out, markets after an -unknown- number of consecutive drops will go up by an -unknown- percentage and vice-versa.

People that are involved with Machine Learning know that the representation of the problem at hand is of high importance…so first we are going to see ways that financial news can be represented in a helpful way.

We have to see with what we are dealing here. To do this, we have to analyze and categorize accordingly the financial information as this is created. Financial News can be news about a number of things :

1) The number of jobless claims in US is higher than last year.
2) Automotive company’s XYZ sales were dropped by 15%
3) Oil prices hit -yet- another record high
4) The dollar is dropping

….and the list goes on.

So the first problem arises : Should we categorize the information according to its content and present it to the algorithms? We could do that by having a boolean field for each type of news on our training file and set it accordingly to TRUE or FALSE values. By using this method we could easily reach thousands of input fields, since for the “jobless claims” news type we could have the following variants :

-A specific country for the jobless claim report (not only the US, it could be any country)

-Jobless claims could be higher than expected or higher than last year or the highest in the last decade.

It is easy to see that this gets way too fast out of control. Perhaps a better solution would be to try to create clusters of (more or less) the same news. The idea of clustering the financial news might seem an interesting one and an analyst could define a number of clusters -say he is after 100- and let the clustering process categorize accordingly all the news. But is clustering the solution? More on this on the next post…

Link to original post