What if your computer could find ideas in documents? Building on the idea of fingerprinting documents, ai-one helped develop ai-BrainDocs – a tool to mine large sets of documents to find ideas using intelligent agents. This solves a big problem for knowledge workers: How to find ideas in documents that are missed by traditional keyword search tools (such as Google, Lucine, Solr, FAST, etc.).
Customers Struggle with Unstructured Text
Almost every organization struggles to find value in “big data” – especially ideas buried within unstructured text. Often a very limited set of vocabulary can be used to express very different ideas. Lawyers are particularly talented at this: They can use 100 unique words to express thousands of ideas by simply changing the ordering and frequencies of the words.
Lawyers are not the only ones that need to find ideas inside documents. Other use cases include finding and classifying complaints, identifying concepts within social media feeds such as Twitter or Facebook and mining PubMed find related research articles. Recently, we have had several healthcare companies contact us to mine electronic health records (EHR) data to find information that is buried within doctors notes so they can predict adverse reactions, find co-morbidity risks and detect fraud.
The common denominator for all these uses cases is simple: How to find “what matters most” in documents? They need a way to find these ideas fast enough to keep pace with the growth in documents. Given that information is growing at almost 20% per year – this means that a very big problem now will be enormous next year.
Problems with Current Approaches
We’ve heard numerous stories from customers who were frustrated at the cost, complexity and expertise required to implement solutions to enable machines to read and understand the meaning of free-form text. Often these solutions use latent semantic indexing (LSI) and latent Dirichlet allocation (LDA). In one case, a customer spent more than two years trying to combine LSI with a Microsoft FAST Enterprise search appliance running on SharePoint. It failed because they were searching a high-volume of legal documents with very low variability. They were searching legal contracts to find paragraphs that included a very specific legal concept that could be expressed with many different combinations of words. Keyword search failed because the legal concept used commonly used words. LSI and LDA failed because the systems required a very large training set – often involving hundreds of documents. Even after reducing the specificity requirements, LSI and LDA still failed because they could not find the legal ideas at the paragraph level.
We found inspiration in the complaints we heard from customers: What if we could build an “intelligent agent” that could read documents like a person? We thought of the agent as an entry-level staff person who could be taught with a few examples then highlight paragraphs that were similar to (but not exactly like) the teaching examples.
Solution: Building Intelligent Agents
For several months, we have been developing prototypes of intelligent agents to mine unstructured text to find meaning. We built a Java application that combine ai-one’s machine learning API with natural language processing (OpenNLP) and NoSQL databases (MongoDB). Our approach generates an “ai-Fingerprint” that is a representational model of a document using keywords and association words. The “ai-Fingerprint” is similar to a graph G[V,E] where G is the knowledge representation, V (vertices) are keywords, and E (edges) are associations. This can also be thought of as a topic model.
The ai-Fingerprint can be generated for almost any size text – from sentences to entire libraries of documents. As you might expect, the “intelligence” (or richness) of the ai-Fingerprint is proportional to the size of text it represents. Very sparse text (such as a tweet) has very little meaning. Large texts, such as legal documents, are very rich. This approach to topic modelling is precise — even without training or using external ontologies.
[NOTE: We are experimenting with using ontologies (such as OWL and RDF) as a way to enrich ai-Fingerprints with more intelligence. We are eager to find customers who want to build prototypes using this approach.]
The Secret Sauce
The magic is that ai-one’s API automatically detects keywords and associations – so it learns faster, with fewer documents and provides a more precise solution than mainstream machine learning methods using latent semantic analysis. Moreover, using ai-one’s approach makes it relatively easy for almost any developer to build intelligent agents.
How to Build Intelligent Agents?
To build an intelligent agent, we first had to consider how a human reads and understands a document.
The Human Perspective
Human are very good at detecting ideas – regardless of the words used to express them. As mentioned above, lawyers can express dozens of completely different legal concepts with a vocabulary of just a few hundred words. Humans can recognize the subtle differences of two paragraphs by how a lawyer uses words – both in meaning (semantics) and structure (syntax). Part of the cleverness of a lawyer is finding ways to combine as few words as possible to express a very precise idea to accomplish a specific legal or business objective. In legal documents, each new idea is almost always expressed in a paragraph. So two paragraphs might have the exact same words but express completely different ideas.
To find these ideas, a person (or computer) must detect the patterns of word use – similar to the finding a pattern in a signal. For example, as a child I knew I was in trouble when my mother called me by my first and last name – the combination of these words created a “signal” that was different than when she just used my first name. Similarly, a legal concept has a different meaning if two words occur together, such as “written consent” than if it only uses the word “consent.”
The (Conventional) Machine Learning Perspective
It’s almost impossible to program a computer to find such “faint signals” within a large number of documents. To do so would require a computer to be programmed to find all possible combinations of words for a given idea to search and match.
Machine learning technologies enable computers to identify features within the data to detect patterns. The computer “learns” by recognizing the combinations of features as patterns.
[There are many forms of machine learning – so I will keep focused only on those related to our text analytics problem.]
Natural Language Processing
One of the most important forms of machine learning for text analytics is natural language processing (NLP). NLP tools are very good at codifying the rules of language for computers to detect linguistic features – such as parts of speech, named entities, etc.
However (at the time of this writing), most NLP systems can’t detect patterns unless they are explicitly programmed or trained to do so. Linguistic patterns are very domain specific. The language used in medicine is different than what is used in law, etc. Thus, NLP is not easily generalized. NLP only works in specific situations where there is predictable syntax, semantics and context. IBM Watson can play Jeopardy! but has had tremendous problems finding commercial applications in marketing or medical records processing. Very few organizations have the budget or expertise to train NLP systems. They are left to either buy an off-the-shelf solution (such as StoredIQ ) or hire a team of PhDs to modify one of the open-source NLP tools. Good luck.
Latent Analysis Techniques
Tools such as latent semantic analysis (LSA), latent semantic indexing (LSI) and latent Dirichlet allocation (LDA) are all capable of detecting patterns within language. However, they require tremendous expertise to implement and often require large numbers of training documents. LSA and LSI are computationally expensive because they must recalculate the relationships between features each time they are given something new to learn. Thus, learning the meaning of the 1,001th document requires a calculation across the 1,000 previously learned documents. LSA uses a statistical approach called single variable decomposition to isolate keywords. Unlike LSA, ai-one’s technology also detects the association words that give a keyword context.
Similar to our ai-Fingerprint approach, LDA uses a graphical model for topic discovery. However, it takes tremendous skill to develop applications using LDA. Even when implemented, it requires the user to make informed guesses about the nature of the text. Unlike LDA, ai-one’s technology can be learned in a few hours. It requires no supervision or human interaction. It simply detects the inherent semantic value of text – regardless of language.
Our First Intelligent Agent Prototype: ai-BrainDocs
It took our team about a month to build the initial version of ai-BrainDocs. Our team used ai-one’s keyword and association commands to generate a graph for each document. This graph goes into MongoDB as a JSON object that represents the knowledge (content) of each document.
Next we created an easy way to build intelligent agents. We simply provide the API with examples of concepts we want to find. This training set can be very short. For one type of legal contracts, it only took 4 examples of text for the intelligent agent to achieve 90% accuracy in finding similar concepts.
Unlike solutions that use LSI, LDA and other technologies, the intelligent agents in ai-BrainDocs finds ideas at the paragraph level. This is a huge advantage when looking at large documents – such as medical research or SEC filings.
Next we built an interface that allows the end-user to control the intelligent agents by setting thresholds for sensitivity and determining how many paragraphs to scan at a time.
Our first customers are now testing ai-BrainDocs – and so far they love it. We expect to learn a lot as more people use the tool for different purposes. We are looking forward to developing ways for intelligent agents to interact – just like people – by comparing what they find within documents. We are finding that it is best for each agent to specialize in a specific subject. So finding ways for agents to compare their results using Boolean operators enables them to find similarities and differences between documents.
One thing is clear: Intelligent agents are ideal for mining unstructured text to find small ideas hidden in big data.
We look forward to reporting more on our work with ai-BrainDocs soon.
Posted by: Olin Hyde