“I was asked to help sell machine learning to a bank. They asked us to have a look at predicting failures of their payment system. They had previous attempts with various technologies and various consulting companies, but they all failed. Training data would be transactions processed in the past.
My dialog with the client team went like this:
– (me) Do we know which transactions failed among all the training data transactions?
– Do we have enough data?
– Yes, we have millions in our training data set.
– Perfect, can I get access to data?
It took a while to get access because of security concerns, as is often the case in regulated industries like banking, but I finally got access to a file of million transactions. I looked at it, there were hundreds of features, but I could not find where the target values were (whether a given transaction is considered as a failure or not). When I asked about it, I was pointed to a second file where the target values where. True, the values were there, and all I needed to do was to join it with the other data set.
There was an issue though: the target file contained only 1,700 transactions or so, and only 8 failures out of these 1,700 transactions. No wonder no machine learning approach would work: how can you learn from 8 examples only?
I suggested they spent time labeling other transactions. Assuming 10 transaction labels per minute, a one man week effort would yield more than 10 times more examples to learn from.
Yet, the client team refused to do that, and guess what, they still have no success with machine learning.” Read more at this Jean-Francois Puget post.