A lot of valuable insights are logged in enterprises’ huge corpus of documents. Retrieving the right insight for the right context is extremely important in unlocking this value. For instance, to get answers to a question posted via a chatbot greatly depends on unlocking the answer from the appropriate document.
Commonly applied full-text search often doesn’t return compelling answers. In this blog, we will explore NLP-based sophisticated techniques for information retrieval.
It can be achieved through both supervised and unsupervised techniques. What we have created is an Unsupervised Information Retrieval System, DocSearch which can be integrated into a chatbot as a service where the user can get information through natural queries.
Getting the Data
The data can be from multiple sources like web repository, PDF, word or text documents. All the documents are scraped, preprocessed and collated into a corpus in S3, which acts as a single source of information for model creation.
We used Latent Semantic Indexing (LSI) to find and rank the documents. LSI model is one of the Information retrieval methods. It is based on the Distributional hypothesis ie., words that occur in the same contexts and tend to have a similar meaning. It helps in producing concepts related to the documents. We can use this to find similarities in the documents.
Developing LSI Model
- Create a matrix of term frequency in respective documents, representing the number of times the term appears in the document. The matrix is large and sparse.
Each cell represents the frequency of term i in document j.
- On the Term-Document matrix, apply a weighting function. The weighting function transforms each cell to weight relative to the frequency of a term in a document and the frequency of the term in the entire document collection. Here we have used the Log-Entropy weighting function, similar to TF-IDF.
- Perform a low-rank approximation to reduce the dimension of the matrix. In LSI, Singular Value Decomposition(SVD) is used to reduce the dimension.
- After decomposition of X, a mxn matrix, where m is the number of unique terms and n is the number of documents, we get below three matrices.
- The SVD is truncated to reduce the rank to k << r, where k is typically in the order of 100 to 300. In our implementation, we chose k as 200. Finally,
- -> Tk is a matrix of term vectors
-> Dk is a matrix of document vectors
The Document-Topic matrix(Dk) which creates a document vector space, helps us in finding the documents relevant to the query.
- When the query comes in, the sparse term frequency vector(q) is computed and the weighting function is applied. Then the sparse vector will be converted into a dense query vector(Qk) in document vector space, using the below equation.
- The similarities between the query vector (Qk) and the document vector (Dk) is calculated using Cosine similarity.
- Based on the similarities we can rank the documents and show the user the best matching documents.
- Qk = qTTk-1
- We have implemented a domain-specific preprocessing to improve the quality of the document corpus for the model.
Eg. Normalising Volt, voltages, V to volt
- Instead of a unigram model we implemented a bigram model to increase the relevancy. Bigram model converts all unigram words to bigram, helps in finding documents with appropriate phrases.
In the unigram model: ‘safety’, ‘switch’ will be the search word that gives noisy results.
In the bigram model: ‘safety switch’ will be the search word which will produce relevant results
- Query alteration – We implemented a generalized HMM model to correct the errors of the query, like a spelling mistake, merging error, splitting error and misuse.
- Query expansion – An external domain-specific synonym mapper helps to expand the query with the synonyms, which helps in getting more relevant documents.
Whenever a new document is uploaded the model will be updated with the new documents and the entire flow has been automated.