Archive for the ‘data mining’ Category

Context, Graphs and the Future of Computing

Friday, June 20th, 2014

Robert Scoble and Shel Israel’s latest book, Age of Context, is a survey of the contributions across the globe to the forces influencing technology and our lives today.  The five forces are mobile, social media, data, sensors and location.  Scoble calls these the five forces of context and harnessed, they are the future of computing.

Pete Mortensen also addressed context in his brilliant May 2013 article in Fast Company “The Future of Technology Isn’t Mobile, It’s Contextual.”   So why is context so important (and difficult)?  First, context is fundamental to our ability to understand the text we’re reading and the world we live in.  In semantics, there is the meaning of the words in the sentence, the context of the page, chapter, book and prior works or conversations, but also the context the reader’s education and experience add to the understanding.  As a computing problem, this is the domain of text analytics.

Second, if you broaden the discussion as Mortensen does to personal intelligent agents (Siri, Google Now), the bigger challenge is complexity.  Inability to understand context has always made it difficult for computers and people to work together.  People and the language we use to describe our world is complex, not mathematical, You can’t be reduced to a formula or rule set, no matter how much data is crunched. Mortensen argues (and we agree) that the five forces are finally giving computers the foundational information needed to understand “your context” and that context is expressed in four data graphs.  These data graphs are

  • Social (friends, family and colleagues),
  • Interest (likes & purchases),
  • Behavior (what you do & where) and
  • Personal (beliefs & values).

While Google Glass might be the poster child of a contextual UX, ai-one has the technology to power these experiences by extracting Mortensen’s graphs from the volumes of complex data generated by each of us through our use of digital devices and interaction with increasing numbers of sensors known as the Internet of Things (IoT).  The Nathan API is already being used to process and store unstructured text and deliver a representation of that knowledge in the form of a graph.  This approach is being used today in our BrainDocs product for eDiscovery and compliance.

Age of Context by Scoble and IsraelIn Age of Context, ai-one is pleased to be recognized as a new technology addressing the demands of these new types of data.  The data and the applications that use them are no longer stored in silos where only domain experts can access them.  With Nathan the data space learns from the content, delivering a more relevant contextual response to applications in real time with user interfaces that are multi-sensory, human and intuitive.

We provide developers this new capability in a RESTful API. In addition to extracting graphs from user data, they can build biologically inspired intelligent agents they can train and embed in intelligent architectures.   Our new Nathan is enriched with NLP in a new Python middleware that allows us to reach more OEM developers.  Running in the cloud and integrated with big data sources and ecosystems of existing APIs and applications, developers can quickly create and test new applications or add intelligence to old ones.

For end users, the Analyst Toolbox (BrainBrowser and BrainDocs) demonstrates the value proposition of our new form of artificial intelligence and shows developers how Nathan can be used with other technologies to solve language problems.  While we will continue to roll out new features to this SaaS offering for researchers, marketers, government and compliance professionals, the APIs driving the applications will be available to developers.

Mortensen closes, “Within a decade, contextual computing will be the dominant paradigm in technology.”  But how?  That’s where ai-one delivers.  In coming posts we will discuss some of the intelligent architectures built with the Nathan API.

ai-one Contributes to ETH Publication on Knowledge Representation

Tuesday, June 3rd, 2014

We are pleased to announce the availability of the following publication from prestigious ETH University in Zurich.  This book will be a valuable resource to developers, data scientists, search and knowledge management educators and practitioners trying to deal with the massive amounts of information in both public and private data sources.  We are proud to have our contribution to the field acknowledged in this way.

Knowledge Organization and Representation with Digital Technologies

http://www.degruyter.com/view/product/205460  |  ISBN: 978-3-11-031281-2

ai-one was invited to contribute as co-author to a chapter in this technical book.

ETH Publication- Knowledge RepresentationIn the anthology readers will find very different conceptual and technological methods for modeling and digital representation of knowledge for knowledge organizations (universities, research institutes and educational institutions), and companies based on practical examples presented in a synopsis. Both basic models of the organization of knowledge and technical implementations are discussed including their limitations and difficulties in practice.  In particular the areas of knowledge representation and the semantic web are explored. Best practice examples and successful application scenarios provide the reader with a knowledge repository and a guide for the implementation of their own projects. The following topics are covered in the articles:

  •  hypertext-based knowledge management
  • digital optimization of the proven analog technology of the list box
  • innovative knowledge organization using social media
  • search process visualization for digital libraries
  • semantic events and visualization of knowledge
  • ontological mind maps and knowledge maps
  • intelligent semantic knowledge processing systems
  • fundamentals of computer-based knowledge organization and integration

The book also includes coding medical diagnoses, contributions to the automated creation of records management models, business fundamentals of computer-aided knowledge organization and integration, the concept of mega regions to support of search processes and the management of print publications in libraries.

Available in German only at this time.

Wissensorganisation und -repräsentation mit digitalen Technologien

http://www.degruyter.com/view/product/205460  |  ISBN: 978-3-11-031281-2

ai-one war eigeladen worden, als CO-Autor ein Kapitel in diesem Sachbuch beizusteuern.

Im Sammelband werden die sehr unterschiedlichen konzeptionellen und technologischen Verfahren zur Modellierung und digitalen Repräsentation von Wissen in Wissensorganisationen (Hochschulen, Forschungseinrichtungen und Bildungsinstitutionen) sowie in Unternehmen anhand von  praxisorientierten Beispielen in einer Zusammenschau vorgestellt. Dabei werden sowohl grundlegende Modelle der Organisation von Wissen als auch technische Umsetzungsmöglichkeiten sowie deren Grenzen und Schwierigkeiten in der Praxis insbesondere in den Bereichen der Wissensrepräsentation und des Semantic Web ausgelotet. Good practice Beispiele und erfolgreiche Anwendungsszenarien aus der Praxis bieten dem Leser einen Wissensspeicher sowie eine Anleitung zur Realisierung von eigenen Vorhaben. Folgende Themenfelder werden in den Beiträgen behandelt:

  • Hypertextbasiertes Wissensmanagement
  • digitale Optimierung der erprobten analogen Technologie des Zettelkastens
  • innovative Wissensorganisation mittels Social Media
  • Suchprozessvisualisierung für Digitale Bibliotheken
  • semantische Event- und Wissensvisualisierung
  • ontologische Mindmaps und Wissenslandkarten
  • intelligente semantische Wissensverarbeitungssysteme

sowie Grundlagen der computergestützten Wissensorganisation und -integration, das Konzept von Mega-Regionen zur Unterstützung von Suchprozessen und zum Management von Printpublikationen in Bibliotheken, automatisierte Kodierung medizinischer Diagnosen sowie Beiträge zum Records Management zur Modellbildung und Bearbeitung von Geschäftsprozessen.

Big Data Solutions: Intelligent Agents Find Meaning of Text

Friday, January 18th, 2013

 

ai-BrainDocs AgentWhat if your computer could find ideas in documents? Building on the idea of fingerprinting documents, ai-one helped develop ai-BrainDocs – a tool to mine large sets of documents to find ideas using intelligent agents. This solves a big problem for knowledge workers: How to find ideas in documents that are missed by traditional keyword search tools (such as Google, Lucine, Solr, FAST, etc.).

Customers Struggle with Unstructured Text

Almost every organization struggles to find value in “big data” – especially ideas buried within unstructured text. Often a very limited set of vocabulary can be used to express very different ideas. Lawyers are particularly talented at this: They can use 100 unique words to express thousands of ideas by simply changing the ordering and frequencies of the words.

Lawyers are not the only ones that need to find ideas inside documents. Other use cases include finding and classifying complaints, identifying concepts within social media feeds such as Twitter or Facebook and mining PubMed find related research articles. Recently, we have had several healthcare companies contact us to mine electronic health records (EHR) data to find information that is buried within doctors notes so they can predict adverse reactions, find co-morbidity risks and detect fraud.

The common denominator for all these uses cases is simple: How to find “what matters most” in documents? They need a way to find these ideas fast enough to keep pace with the growth in documents. Given that information is growing at almost 20% per year – this means that a very big problem now will be enormous next year.

Problems with Current Approaches

We’ve heard numerous stories from customers who were frustrated at the cost, complexity and expertise required to implement solutions to enable machines to read and understand the meaning of free-form text. Often these solutions use latent semantic indexing (LSI) and latent Dirichlet allocation (LDA). In one case, a customer spent more than two years trying to combine LSI with a Microsoft FAST Enterprise search appliance running on SharePoint. It failed because they were searching a high-volume of legal documents with very low variability. They were searching legal contracts to find paragraphs that included a very specific legal concept that could be expressed with many different combinations of words. Keyword search failed because the legal concept used commonly used words. LSI and LDA failed because the systems required a very large training set – often involving hundreds of documents. Even after reducing the specificity requirements, LSI and LDA still failed because they could not find the legal ideas at the paragraph level.

Inspiration

We found inspiration in the complaints we heard from customers: What if we could build an “intelligent agent” that could read documents like a person? We thought of the agent as an entry-level staff person who could be taught with a few examples then highlight paragraphs that were similar to (but not exactly like) the teaching examples.

Solution: Building Intelligent Agents

For several months, we have been developing prototypes of intelligent agents to mine unstructured text to find meaning. We built a Java application that combine ai-one’s machine learning API with natural language processing (OpenNLP) and NoSQL databases (MongoDB). Our approach generates an “ai-Fingerprint” that is a representational model of a document using keywords and association words. The “ai-Fingerprint” is similar to a graph G[V,E] where G is the knowledge representation, V (vertices) are keywords, and E (edges) are associations. This can also be thought of as a topic model.

ai-FingerprintThe ai-Fingerprint can be generated for almost any size text – from sentences to entire libraries of documents. As you might expect, the “intelligence” (or richness) of the ai-Fingerprint is proportional to the size of text it represents. Very sparse text (such as a tweet) has very little meaning. Large texts, such as legal documents, are very rich. This approach to topic modelling is precise — even without training or using external ontologies.

[NOTE: We are experimenting with using ontologies (such as OWL and RDF) as a way to enrich ai-Fingerprints with more intelligence. We are eager to find customers who want to build prototypes using this approach.]

The Secret Sauce

The magic is that ai-one’s API automatically detects keywords and associations – so it learns faster, with fewer documents and provides a more precise solution than mainstream machine learning methods using latent semantic analysis. Moreover, using ai-one’s approach makes it relatively easy for almost any developer to build intelligent agents.

How to Build Intelligent Agents?

To build an intelligent agent, we first had to consider how a human reads and understands a document.

The Human Perspective

Human are very good at detecting ideas – regardless of the words used to express them. As mentioned above, lawyers can express dozens of completely different legal concepts with a vocabulary of just a few hundred words. Humans can recognize the subtle differences of two paragraphs by how a lawyer uses words – both in meaning (semantics) and structure (syntax). Part of the cleverness of a lawyer is finding ways to combine as few words as possible to express a very precise idea to accomplish a specific legal or business objective. In legal documents, each new idea is almost always expressed in a paragraph. So two paragraphs might have the exact same words but express completely different ideas.

To find these ideas, a person (or computer) must detect the patterns of word use – similar to the finding a pattern in a signal. For example, as a child I knew I was in trouble when my mother called me by my first and last name – the combination of these words created a “signal” that was different than when she just used my first name. Similarly, a legal concept has a different meaning if two words occur together, such as “written consent” than if it only uses the word “consent.”

The (Conventional) Machine Learning Perspective

It’s almost impossible to program a computer to find such “faint signals” within a large number of documents. To do so would require a computer to be programmed to find all possible combinations of words for a given idea to search and match.

Machine learning technologies enable computers to identify features within the data to detect patterns. The computer “learns” by recognizing the combinations of features as patterns.

[There are many forms of machine learning – so I will keep focused only on those related to our text analytics problem.]

Natural Language Processing

One of the most important forms of machine learning for text analytics is natural language processing (NLP). NLP tools are very good at codifying the rules of language for computers to detect linguistic features – such as parts of speech, named entities, etc.

However (at the time of this writing), most NLP systems can’t detect patterns unless they are explicitly programmed or trained to do so. Linguistic patterns are very domain specific. The language used in medicine is different than what is used in law, etc. Thus, NLP is not easily generalized. NLP only works in specific situations where there is predictable syntax, semantics and context. IBM Watson can play Jeopardy! but has had tremendous problems finding commercial applications in marketing or medical records processing. Very few organizations have the budget or expertise to train NLP systems. They are left to either buy an off-the-shelf solution (such as StoredIQ ) or hire a team of PhDs to modify one of the open-source NLP tools. Good luck.

Latent Analysis Techniques

Tools such as latent semantic analysis (LSA), latent semantic indexing (LSI) and latent Dirichlet allocation (LDA) are all capable of detecting patterns within language. However, they require tremendous expertise to implement and often require large numbers of training documents. LSA and LSI are computationally expensive because they must recalculate the relationships between features each time they are given something new to learn. Thus, learning the meaning of the 1,001th document requires a calculation across the 1,000 previously learned documents. LSA uses a statistical approach called single variable decomposition to isolate keywords. Unlike LSA, ai-one’s technology also detects the association words that give a keyword context.

Similar to our ai-Fingerprint approach, LDA uses a graphical model for topic discovery. However, it takes tremendous skill to develop applications using LDA. Even when implemented, it requires the user to make informed guesses about the nature of the text. Unlike LDA, ai-one’s technology can be learned in a few hours. It requires no supervision or human interaction. It simply detects the inherent semantic value of text – regardless of language.

Our First Intelligent Agent Prototype: ai-BrainDocs

It took our team about a month to build the initial version of ai-BrainDocs. Our team used ai-one’s keyword and association commands to generate a graph for each document. This graph goes into MongoDB as a JSON object that represents the knowledge (content) of each document.
Next we created an easy way to build intelligent agents. We simply provide the API with examples of concepts we want to find. This training set can be very short. For one type of legal contracts, it only took 4 examples of text for the intelligent agent to achieve 90% accuracy in finding similar concepts.

Unlike solutions that use LSI, LDA and other technologies, the intelligent agents in ai-BrainDocs finds ideas at the paragraph level. This is a huge advantage when looking at large documents – such as medical research or SEC filings.

Next we built an interface that allows the end-user to control the intelligent agents by setting thresholds for sensitivity and determining how many paragraphs to scan at a time.

Our first customers are now testing ai-BrainDocs – and so far they love it. We expect to learn a lot as more people use the tool for different purposes. We are looking forward to developing ways for intelligent agents to interact – just like people – by comparing what they find within documents. We are finding that it is best for each agent to specialize in a specific subject. So finding ways for agents to compare their results using Boolean operators enables them to find similarities and differences between documents.

One thing is clear: Intelligent agents are ideal for mining unstructured text to find small ideas hidden in big data.

We look forward to reporting more on our work with ai-BrainDocs soon.

Posted by: Olin Hyde

Building Machine Learning Tools to Mine Unstructured Text

Friday, February 17th, 2012

This presentation describes how to build tools to find the meaning of unstructured text using machine generated knowledge representation graphs using NLP and ai-one’s Topic-Mapper API.
The prototype solution, called ai-Browser, is a generalized approach that can solve the following types of use cases:
  • Sentiment analysis of social media feeds
  • Evaluating electronic medical records for clinical decision support systems
  • Comparing news feeds
  • Electronic discovery for legal purposes
  • Automatically tagging documents
  • Building intelligent search agents
The source code for ai-Browser is available to developers to customize to meet specific requirements. For example:
  • Healthcare providers can use ai-Browser to analyze medical records by using ontologies and medical lexicons.
  • Social media marketing agencies can use ai-Browser to create personal profiles of customers by reading social media feeds.
  • Researchers can use ai-Browser to mine PubMed and other repositories.
Our goal is to get the source code and the API into the hands of commercial companies who want to tailor the application to solve specific problems.
Click here to download the presentation from SlideShare:
View more presentations from ai-one

Partnership to Create New Social Media Intelligence Tools

Thursday, February 16th, 2012

New Partnership Targets Creation of Social Media Intelligence Tools

Press Release

Tweet log

New tools will enable machine learning of twitter feeds

La Jolla CA | Zurich | Berlin  February 16 2012 – ai-one inc. and Gnostech Inc. announced a partnership today to build new machine learning applications for the US government and military. The deal brings together two small firms that are well known for developing cutting-edge technologies. Gnostech specializes in simulation and modeling, Command Control Communications Computers and Intelligence Surveillance and Reconnaissance (C4ISR) systems and security engineering and Information Assurance (IA) applications. The partnership with ai-one provides Gnostech with access to technology that enables computers to learn the meaning and context of data in a way that is similar to humans. Called “biologically inspired intelligence” the technology is a new form of machine learning that is particularly useful for understanding complex, unstructured information – such as conversations in social media.

In the past month, the US government has issued six requests for companies to create solutions to help better understand TwitterFacebook and other social media sources. These broad area announcements (BAAs) are formal requests from the Government to invite companies to provide turn-key solutions. With more than 800 million people actively using Facebook and more than 100 million Twitter users, governments and intelligence agencies know that they need better ways to mine this data to get real-time information to protect national security.“

We now have more than 40 partners worldwide that are experimenting with our technology – but only 3 that specialize in US government applications,” said Tom Marsh, President of ai-one. “Gnostech is local, technically driven and well positioned to develop rapid prototypes using our technology.”

About Gnostech, Since 1981, Gnostech has provided technical and engineering services to the Department of Defense (DOD) and Department of Homeland Security (DHS). Gnostech has a proven reputation for engineering efficiency, systems innovation, and dedicated customer service.

Gnostech Inc. began as an engineering and consulting company in Warminster, PA with expertise in GPS simulations and software, initially supporting the US Navy at the Naval Air Development Center (NADC) in Warminster, PA. Today, Gnostech has grown from a few people to about 50 employees with a satellite office in San Diego, CA and engineering support staff in Norfolk, VA, Morristown, NJ and Philadelphia, PA. Gnostech’s technical expertise expands upon our GPS experience and extends into Mission Planning, Network Engineering, Information Assurance and Security Engineering.  www.gnostech.com

About ai-one inc., ai-one provides an “API for building learning machines”.  Based in San Diego, Zurich and Berlin, ai-one’s software technology is an adaptive holosemantic data space with semiotic capabilities (“biologically inspired intelligence”).  The Topic-Mapper™ SDK for text enables developers to create intelligent applications that deliver better sense-making capabilities for semantic discovery, lightweight ontologies, knowledge collaboration, sentiment analysis, artificial intelligence and data mining.  www.ai-one.com

Mining Unstructured Text: A new machine learning approach

Monday, February 13th, 2012

We believe we have found a new approach to apply a new general purpose machine learning technology to solve domain-specific problems by mining unstructured text. The solution addresses fundamental problems in knowledge management:

ai-browser is a tool for mining unstructured textHow to find information that is difficult to describe?

For example, you want to find a match between two people to fill an empty job position. What attributes do you use to represent a complex subject (like a person) to find the best fit?

What if the single best answer is hidden within a vast amount of unstructured text?

Let’s say you want to repurpose a drug – such as using the side-effect of a chemical to treat a disease using a newly discovered metabolic pathway. How would you search through the 21+ million research articles in PubMed to find the best match from more than 2,000+ known drug compounds?

What if the textual information is constantly changing?

What if you want to provide personalized marketing to a person based on what they are saying on Facebook, Twitter or LinkedIn?  To do this, you must understand the meaning of what they are saying. The most accurate approach is to have people read and interpret the conversations because we are fantastic at understanding the complexity of language. But to do this with a computer requires a different approach: Machines must learn like humans. They must understand how meaning evolves in a conversation, how to disambiguate, how to detect the single most important concepts, etc.

Big Data Means Big Opportunity

These are classic “Big Data” problems – and they are rampant. Finding a solution would change everything; from how we discover new drugs to what social media would tell us about ourselves.

There have been many attempts to find ways for machines to learn like a human. Artificial intelligence has made bold promises that have been consistently broken for more than 50 years. Yet, we still don’t have a universal approach for machines to learn and understand language like a human.

Growth of Websites

Now, more than ever, we need to find a new approach to mine unstructured text. As of February 2012, it is estimated that the Internet has more than 614 million websites. More than 1.8 zettabytes of information was created in 2011 – more than much of it unstructured text from our comments on websites, news articles, social media feeds… just about anything where people are communicating with language rather than numbers.

Unstructured text can’t be processed like structured data. Rather it requires an approach that enables knowledge representation in a form that can be processed by machines.

Knowledge representation is a rich field and there has been tremendous effort and innovation – too many to describe here. However, we still live in a world where the overwhelming majority of people (including almost every CIO, developer and consumer) CANNOT find the information they seek with a simple query. Rather, the domain of data mining text analytics is dominated by specialists who use tools that are very difficult to learn and very expensive to deploy (because they require highly skilled programmers).

We set out to create a new toolset that would be easy to use for almost any programmer to build data mining tools for unstructured text.

ai-browser: A prototype for human-machine collaboration

For the past several months, we have been working on a new approach for text analytics and data mining. The idea is to create a tool that enables human-machine collaboration to quickly mine unstructured data to find the single best answer.

We now have a working prototype, called ai-browser, that solves knowledge management and data mining problems involving unstructured text. It combines natural language processing (NLP) and pattern recognition technologies to generate a precise knowledge representation graph.  Our team selected OpenNLP because it is open-source, easy to use and customize. We used the Topic-Mapper API to detect patterns within the text after it was pre-processed to isolate parts of speech. The system also allows users to use ontologies and/or reference documents to sharpen the results. The output is a graph that can be used in a number of ways with 3rd party products, such as:

  • Submission to search appliances like Google, Bing, Lucene, etc.
  • Analysis with modelling tools like Cytoscape, MATlab, SAS, etc.
  • Enterprise systems for reporting, knowledge management and/or decision support

This graph makes it easy to ask questions like, “Find me something like _______!” and get a very tightly clustered group of results – rather than millions of hits.

Even more impressive, ai-browser’s graph is a powerful tool that can be applied to a wide range of applications, such as:

  • Healthcare – clinical decision support systems to enable physicians to make better decisions by understanding all the relevant information held in electronic medical records (EMRs) – including emerging trends and relationships within the patient population.
  • Social media – detecting and tracking sentiments in conversations over time (such as Twitter) to understand how brands are perceived by customers.
  • Innovation management – discovering the relationships of information across disciplines to foster more productive collaboration and interdisciplinary discoveries.
  • Information comparison and confirmation – determine the similarities and differences between two different sources of content.
  • Human resources – sourcing and placement of the best candidate for a job based on previous work experience.

The intent of the ai-browser design is to provide a starting point for developers to build solutions to meet the specific needs of enterprise customers. For example, modifying the system enables solutions to the following use cases:

  • Help a physician determine if additional tests are necessary to confirm a diagnosis.
  • Determine how perceptions about a brand are change through conversations on Twitter.
  • Find new uses for a drug by reviewing clinical studies published on PubMed and determining if there are relevant patent filings.
  • Identify stock market trading opportunities by comparing news feeds and SEC filings on a particular company or industry.
  • Finding the best person for a job by searching the internet for someone that is “just like person who has this job last year.”

Enterprise Data Mining: A far easier, lower cost approach.

Unlike other data mining approaches, ai-browser learns the meaning of documents by generating a lightweight ontology – a dynamic file that describes every relationship between every data element. It detects keywords and their association words which provide context. The combination of a keyword and all the association words can be thought of as a coordinate (x,y0->T) where x is the keyword and y0->T is the series of association words for that specific keyword. The collection of these coordinates creates a topology for the document: G(V,E) where G is graph and V is the set of vertices (or nodes) represented by each keyword and E is the edge represented by the associations to the keyword.

ai-fingerprint of Fox News Article

We call this graph the “ai-fingerprint.” It is a lossless knowledge representation model. It captures the meaning of the document by showing the context of words and the clustering of concepts. It is lossless because it captures every relationship in a directed graph – thereby revealing the significance of a word that may only appear once yet is central to the meaning of a large, complex textual data set.

ai-browser expresses ai-fingerprints uses the XGMML format in REST. This enables it to accommodate dynamic data, so it can change as the underlying text changes (such as in text from social media feeds).

Contact Olin Hyde to schedule a demo of ai-Browser. The source code is available to programmers to license and modify to solve specific problems.