Joining artificial intelligence, linguistics and statistics, Natural Language Processing (NLP) is a discipline 50 year in the making.
Artificial intelligence (AI) is everywhere nowadays: smart cities, robots, virtual assistants, genome editing and many other cutting-edge technologies are powered by some degree of simulation of human intelligence via a computer. At ELCA, we mainly apply artificial intelligence to the analysis of human (or natural) language. Indeed, making sense of content is an inevitable requirement for any organization dealing with text documents, images, audio and video content. As most of such information is either created as or converted to text, one key concept in AI is natural language processing (NLP), i.e. the automated processing of human language via a computer.
Joining artificial intelligence, linguistics and statistics, NLP is a 50 years in the making discipline. Recently, the industrial breakthroughs in automatic speech recognition and understanding (Apple’s Siri, Amazon’s Alexa), question answering (IBM’s Watson) and sentiment analysis have proven the maturity of NLP technologies, with state-of-the-art solutions based on statistics and machine learning.
AI “hype” is providing a boost to these technologies, as deep learning algorithms become more widely adopted to solve many kinds of tasks. However, NLP is used even in contexts where there is a limited number of documents to be processed, e.g. in small and medium enterprise settings, where tried-and-true machine learning algorithms such as logistic regression, Naïve Bayes, random forests or SVMs are very effective.
NLP is a vital player in decision support, scrutinizing textual data in search of interesting entities or events and identifying patterns and situations that deserve further human investigation. With sound probabilistic approaches and numerical evaluation metrics, it unlocks the latent semantics of unstructured information sources, be it free-text documents, social media messages, phone conversations, or more. In other words, NLP provides the key to converting data from all kinds of sources into knowledge, i.e. actionable intelligence. And that is just part of the story: thanks to the recent applications of deep learning to speech processing, it is now possible to transcribe the content of audio and video files in real-time.
Likewise, scanned documents can be processed with OCR to become machine readable, and image processing techniques can be deployed to recognize faces in CCTV footage or social network photos. This in turn can help us complete the picture concerning a person’s social network, as well as contribute to identifying its “missing links” (relationships with other individuals or entities).
ELCA’s AI & NLP Solutions for Your Organization
With a solid knowledge of NLP, machine learning and related solutions, ELCA helps organizations deal with all kinds of unstructured information.
See a quick overview below, or read the following sections to see how ELCA guides decision makers in selecting and integrating the most appropriate solution into specific scenarios.
Idea: Import data from unstructured data sources thanks to Web crawlers and connectors tailored to social networks, file systems, email, CMS, etc.
Where it’s used: All of the subsequent applications are fed by the output of such connectors; in many cases, “metadata mapping” takes place within the connector in order to select relevant information; for example, an HTML meta such as “keywords” can be selected to become the “tags” field in the search engine’s internal data model.
Idea: Efficient and effective search is the first step to understanding the content of unstructured documents (files, web pages, social network messages, multimedia). Indeed, NLP is a powerful ally of search and monitoring applications.
Where it’s used: Enterprise search engines support great volumes of data and intelligent text mining workflows within the privacy constraints of their own firewalls. Organizations that need to access information while adhering to the privacy policies of their intranets can thus benefit from fully in-house search solutions.
Idea: Given any question (“What is severance pay?”, “What’s the surface area of Switzerland?”), obtain concise and relevant answers without the need to browse through all supporting documents.
Where it’s used: Wolfram Alpha and IBM’s Watson are examples of production systems that can detect relevant information to automatically answer even complex questions based on large amounts of supporting documents.
Social network analysis
Idea: Monitor what social networks are talking about - and in what terms.
Where it’s used: With the increased popularity of blogs and social networks, information on the Web is more and more subjective: statistical NLP solutions make sure this valuable source of information is not overlooked, whatever the format or language. For example, they allow analysts to:
- Identify positive, negative or neutral sentiment on a particular topic/at a given location.
- Guess age range and gender of Twitter users based on what they write.
- Distinguish comments from statements or suggestions on social media.
- Determine the political orientation of blog contents.
- Identify clusters of people interested in the same topic.
- Identify relationships between users and track the evolution of relationships/interests over time for target users.
Idea: Automatically categorize documents according to any business taxonomy (e.g. news by category, travelers by type, Web site visitors by profile) based on representative samples.
Where it’s used: All kinds of business situations where paper content must be digitized/archived calls for OCR technology: for example, analyzing fax content, customs bills, commercial transactions and discharge letters. Moreover, defense and control organizations such as local police need face recognition to quickly detect and distinguish faces from e.g. CCTV footage.
Idea: Search for information in multiple languages; translate results into the language of choice.
Where it’s used: Institutions and organizations whose information needs span across documents in different languages use automatic translation systems and/or services to quickly convert one text to multiple languages. Translated versions of important documents can then be searched by enterprise search engines to perform cross-language information retrieval.
GIS and geocoding
Idea: Geographic Information Systems (GIS) integrate, store and analyze geographical information. With such systems, it is possible to resolve the geographical coordinates of places mentioned in textual and multimedia documents.
Where it’s used: In any situation where a customer portfolio is involved (bank, insurance, retail), being able to geographically locate entities (products, customers, transactions) provides a commercial advantage. Military organizations rely heavily on GIS and geolocation for mission control. In a news-monitoring context (financial intelligence, news operators), analysts require geolocation to identify places mentioned in their documents.
Named entity recognition
Idea: Identify people, locations, organizations and other entities mentioned in text (dates, email addresses, license plates and telephone numbers) – no lists needed!
Where it’s used: information extraction toolkits and services are widely used by organizations seeking to automatically identify relevant documents with respect to their “business” entities; for example, news agencies can automatically link current articles to similar previous articles based on peopled mentioned in them.
Relationship discovery and visualization
Idea: Detect and visualize relationships between entities mentioned both within documents and across documents; use Semantic Web resources to collect more information on known entities, e.g. coordinates of a city for geolocation/organizations linked to a person.
Where it’s used: news agencies, institutions such as NGOs or e-Government bodies and private companies with large archives use Semantic Web tools (Linked Data) to automatically tag their content and enrich the information it contains; thanks to these, Wikipedia-derived information can be crossed-linked with the information extracted from documents to acquire more details.
Idea: Quickly tag and summarize documents by identifying their most important words, expressions or topics (i.e. bottom-up categories). Thanks to topic and key-phrase extraction, documents can be “tagged” automatically, making them candidates for further analysis or simply providing semantic filters to search engines.
Where it’s used: Many document management suites and open source solutions allow for the automatic tagging of documents. This is particularly useful to quickly grasp the main content of documents, often complementing named entity recognition, which focuses on proper names only.
The AI "hype" provides a boost to these technologies, as deep learning algorithms become more widely adopted to solve many different problems. However, NLP is used even in contexts where there is a limited number of documents to be processed, as long as machine learning algorithms can be applied.
NLP: Selected applications
Virtual assistants & Chatbots
Virtual assistants are artificially intelligent agents that help users find information or perform tasks through conversation. They can be embedded in a company’s website, or chat with its customers via mobile apps – some even say that “chatbots” are the new apps. You can reach them through smart speakers and many other channels, including major messenger platforms.
Chatbots can be effective customer care solutions: they are able to instantly address frequently asked questions and to perform easy, repetitive tasks – this improves customer satisfaction while reducing contact agent workload. However, chatbots are also very useful inside the workplace: they can serve as IT helpdesk assistants or aids when browsing enterprise documents or carrying out procedures.
Developing a chatbot is an exciting opportunity to combine existing natural language processing techniques (such as intent categorization, entity extraction, and dialog management) for a new application. However, it is also a challenging experience from business, compliance and legal viewpoints. ELCA has demonstrated know-how helping many organizations offer chatbots and virtual assistants to their customers or collaborators. We use several technology providers as well as custom open source components to deliver the best fitting solution to our customers.
Many organizations require or would greatly benefit from the automatic labeling of documents according to their specific business needs. For instance, a bank may wish to assign its customers’ emails to the appropriate contact center agent; an insurance company may need to categorize statements by type. A law enforcement unit may wish to distinguish between relevant and non-relevant documents (e.g. analyst reports) for an ongoing operation.
ELCA has proven experience with tailor-made machine learning models for automatic document classification. Usually, we are able to reach extremely satisfactory results even with a very limited amount of training data and using no business rules or lists. But how does this work? Our artificial learners use purely statistical features (such as word distributions) and make efficient use of human contribution by requesting their feedback only in particularly difficult cases.
But automatic classification need not be applied to documents only. We have experience categorizing tweets according to their sentiment and their author’s age range, gender and political views – in several languages. Our image processing algorithms, based on deep neural networks, can detect objects in pictures and assign them to a class (taxi, donut, tabby cat). ELCA’s automatic classifiers become enablers for third-party applications such as document management systems, search and distribution systems, and many more kinds of customer solutions.
Dashboards for content analysis
ELCA leverages its expertise in open source and proprietary search solutions alike to efficiently index text documents, audio, image, email, and crawl web sites and social networks. Functionalities such as faceted browsing (i.e. navigation via dynamic filters), metadata extraction and automatic language identification are now a standard of enterprise search, but we can go deeper than that. With ELCA’s dashboards for content analysis, information from several news websites and Twitter is aggregated in a search-based application where:
- The content of news, blogs and tweets can be searched by keyword.
- The names of people, places and organizations are automatically identified and can be used as dynamic filters; Wikipedia info boxes provide additional information about such entities.
- Tag clouds and network graphs help answer questions such as: Which places appear frequently together with Mr. X? Which companies and organizations are usually mentioned together? Who are the most frequently cited people when searching for a given topic?
- Live sentiment analysis is performed on Twitter streams given the search keyword(s).
The result is a dashboard that can be used either for open-source intelligence (OSINT) or for the integration of an organization’s internal information sources (reports, digests) with external information (news websites, social media, blogs, etc.)
To match the requirements of specific business scenarios, ELCA offers tailor-made solutions combining any of the following building blocks:
- Connectors to social media, news feeds, file systems, APIs.
- Linguistic analysis (POS tagging, Chunking, Syntactic Parsing, N-gram extraction, embeddings).
- Information Extraction, Relationship discovery and visualization.
- Audio analysis (language detection, keyword spotting, speech-to-text transcription).
- Image analysis (OCR, face recognition, etc.).
- Search engines (Solr, Elasticsearch, Exalead, …).
- Clustering and Classification using (deep) learning techniques.