Machine Learning Datasets: Build Or Buy?

    1024 578 Gravy

    IFI CLAIMS Patent Services has a global patent database with more than 110 million records from about 100 countries that the company has painstakingly assembled over the years. “We take information from different data sources and we standardize it and put it in a usable format that companies can either access directly or they can build a user interface on top of it,” Director of Marketing Catherine Suski said. Could the company have acquired this comprehensive dataset instead? Not hardly, according to Suski. There is nothing like it in the world, she said. Furthermore the company is continually adding to it and monitoring it for quality. IFI CLAIMS that it does checks to make sure the data is correct, which involves a lot of manual work by editors. There are more than 2,000 variations on the name IBM,” she said. “Our database standardizes all these variations to make it easier for customers.”

    In our automated and connected economy, data has become the coin of the realm. Companies need it to market to their customers, to develop their products, to come to corporate decisions, to build and test their own apps. It is ubiquitous — or so it would seem. In truth, the right dataset can be as valuable as any hard asset that a company might own. “As much as we all feel inundated with information overload, ask any machine learning researcher and they will tell you that there isn’t enough data to learn,” said Mansi Singhal CEO of qplum. Essentially a company has three choices: it can build it, like IFI CLAIMS did, it can buy it, or it can do a little of both.

    Whatever the company decides, Singhal said, it will not be cheap. “This is something that many other firms are facing as the big hurdle — how to create large datasets economically so that models can be trained and answers can be more accurate and high quality,” she said. But there is more than just cost to consider. Indeed a company has to weigh several — sometimes competing — factors as it makes a decision about the data it will use. It’s rarely a pure IT decision — i.e. choosing the most comprehensive dataset available. “It’s always a game of maximizing value,” said Natalie Robb, founder of Wavelength Analytics. “Like anything else, determining which dataset is best is based on trade-offs,” she said. Meaning the “best” dataset is one that fits the budget, meets the project’s data quantity and quality needs and time constraints.