How Ideas2IT identifies companies using Machine Learning - Ideas2IT

How Ideas2IT identifies companies using Machine Learning

Share This

Looking for someone?

During our interactions with clients and other like-minded people, we’ve often come across a problem that everyone faces. Regardless of their industry, business function, etc. And that is the problem of tracking companies on the internet. Some might want to track competitors, some want to track their potential client base and even others might want to track companies to partner with, etc. The unifying theme is this – all of them want to track some company.

Objectively, this might not seem like a huge problem. After all, there’s Google. Right? Wrong. That’s because companies aren’t named uniquely. What that means is the names are often derived from something we use in real life. For example, Apple. Apple is simultaneously both the most valuable technology company as well as a well-known fruit. And company names can also be referred by monikers, colloquial names, etc. And this is a non-trivial problem.

At Ideas2IT, we look at every problem as an opportunity. We believe in taking every chance we get to create solutions. And we always create those solutions using the latest cutting-edge technologies. Our answer to the current problem – Machine Learning.

How Ideas2IT uses Machine Learning? 

We approached the problem in a way that we can track and identify companies from any text, be it common word stems to greedy search matches. We’ve also been careful to incorporate unstructured piles of content data from various sources such as news, documents, press releases or websites.

Machine Learning Framework - Ideas2IT

We’ve used both technical capabilities and common sense to construct a framework that we believe will satisfy the needs of every searcher. I’ve detailed the approach below. 

  1. Identifying company entities – For starters, we decided to bring in a named entity recognizer for company names. We solved this by training a Conditional Random Field (CRF) classifier. 
  2. Domain knowledge – In order to train the CRF on domain knowledge, we used the help of a variety of real-world company dictionaries. For more specific targeting, we decided to narrow down the dictionaries depending on the specific areas that our target company falls in. 
  3. Dictionary matches – Once we acquired the required dictionaries, we decided to implement a common hack into the model itself. This solved the problem of identifying company names from a given text by using a feature that represents whether a token is part of a company name or not.

We recognize that the world keeps moving at a fast pace. So, we keep updating our dictionaries and train our models based on updated data. In order to avoid complex additions with a poor ROI, we have systems in place to ensure that the knowledge base updates is value-adding rather than a perfunctory one. For this, we also use a proprietary Baseline Configuration Logic along with a curated list of dictionaries. This is updated at a pre-set frequency level and is subject to SLAs. 

The brass tacks on Company Recognition

When identifying companies, it becomes important to classify the data and weed out the non-important terms. For this, we use Named Entity Recognition(NER). NER is a sequence labeling task that aims to classify each word in a given text as belonging to a specific class. The construct our NER system, we use the CRFsuite framework. 

More often than not, company names that are acquired from web sources almost always consist of noise. This can be found in country names, legal forms or other spurious terms. To overcome this problem, we cleanse the data and normalize the names to create a unique stem for each company. Then, we create dictionaries and narrow down features using which we perform the matches by always choosing the longest possible match. 

How Ideas2IT applied the solution? 

We applied this solution to a product from our own stable – PipeCandy. PipeCandy is a market intelligence platform that tracks the global e-commerce landscape. A big part of their business is gathering intelligence about e-commerce companies frequently. As part of this, they needed to identify 750k companies by aggregating data from 100+ data sources. By applying our elegant solution, we were able to quickly build a database that serves them and their clients well. 

Ideas2IT is a high-end product engineering firm with a mission by bringing ideas to life using the latest cutting-edge technologies. If you have a business problem, we have the solution. Reach out to us at sales@ideas2it.com