SciVerify

Using NLP to Improve Fund Allocations

In partnership with Duke University's Office of Research & Innovation, this product focused on avoiding the allocation of funds to research proposals that are too similar to projects that have been previously performed at the institution.

To achieve this, the product uses an ensemble of Natural Language Processing models to parse a database of thousands of lengthy documents, and rank them based on their similarity to the newly introduced proposal. Within the top ranked documents, it also shows specific pairs of sentences across documents that show the highest similarity in terms of content. With this shorter list of similar documents, staff at the OR&I now only need to look at the top few contenders to determine whether an incoming proposal has significant overlap with a preexisting research project.

Additionally, users have the option to give feedback on each entry in the top 5 results, which is then used to fine-tune the ensemble of machine learning models that comprise the search engine.

Duration

May 2022 - July 2022

Team Size

6

Technologies

Python, Jupyter, Machine Learning, Natural Language Processing

Technical Details

All of the code in this project was written in Python, using a variety of libraries to incorporate the NLP models, and all the surrounding features. Due to the large-scale AI models we had to use, the development was made using GPU cores on Duke University's virtual machine system. All in all, this was a very experimental project, so the final processing pipeline was selected through extensive testing in terms of timing and accuracy.

When first parsing the grant proposal database (which requires an average of 5 minutes), each document is condensed down into a few paragraphs containing keywords that indicate that they focus on the content of the research (to avoid flagging based on similar general methodologies, for example). Then, these paragraphs are fed into a transformer model, which generates a 5-sentence summary of its content.

Using this same kind of summary for the new document, another NLP model compares it to the database content, and highlights the most similar entries. Once the top 5 is finalized, yet another NLP model is used to compare just those few full documents to the incoming one, so that specific pairs of sentences within them can be flagged for the most similar content.

The user interface was developed using Voila, a plugin which converts Jupyter notebooks into user-friendly, interactive HTML pages.

My Role

I took on a very prominent leading role in this team. I often found myself delegating tasks, and encouraging sharing and collaboration between team members.

In terms of the development of the product, I was largely responsible for creating and testing the data processing pipeline. I led most of the exploration and fine-tuning of the summarizing models and the sentence-to-sentence comparison, especially when it came to carefully handling the logic behind those to not skew the results, or lead to unnecessary computations that would lead to hour-long executions.

Video Demo

Short demonstration of SciVerify’s main functionality, showcasing the upload of a grant proposal, and the similarity results.

Rodrigo Bassi Guerreiro