By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics in sports industry
    Here’s How Data Analytics In Sports Is Changing The Game
    6 Min Read
    data analytics on nursing career
    Advances in Data Analytics Are Rapidly Transforming Nursing
    8 Min Read
    data analytics reveals the benefits of MBA
    Data Analytics Technology Proves Benefits of an MBA
    9 Min Read
    data-driven image seo
    Data Analytics Helps Marketers Substantially Boost Image SEO
    8 Min Read
    construction analytics
    5 Benefits of Analytics to Manage Commercial Construction
    5 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Guest Post: Information Retrieval using a Bayesian Model of Learning and Generalization
Share
Notification Show More
Latest News
data analytics in sports industry
Here’s How Data Analytics In Sports Is Changing The Game
Big Data
data analytics on nursing career
Advances in Data Analytics Are Rapidly Transforming Nursing
Analytics
data analytics reveals the benefits of MBA
Data Analytics Technology Proves Benefits of an MBA
Analytics
anti-spoofing tips
Anti-Spoofing is Crucial for Data-Driven Businesses
Security
ai in software development
3 AI-Based Strategies to Develop Software in Uncertain Times
Software
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Business Intelligence > Guest Post: Information Retrieval using a Bayesian Model of Learning and Generalization
Business Intelligence

Guest Post: Information Retrieval using a Bayesian Model of Learning and Generalization

Daniel Tunkelang
Last updated: 2010/04/04 at 10:18 PM
Daniel Tunkelang
12 Min Read
SHARE

 

People are very good at learning new concepts after observing just a few examples. For instance, a child will confidently point out which animals are “dogs” after having seen only a couple of examples of dogs before in their lives. This ability to learn concepts from examples and to generalize to new items is one of the cornerstones of intelligence. By contrast, search services currently on the internet exhibit little or no learning and generalization.

Bayesian Sets is a new framework for information retrieval based on how humans learn new concepts and generalize.  In this framework a query consists of a set of items which are examples of some concept. Bayesian Sets …

More Read

How To Kill It at SEO Like Zappos

Google+ Is Like 401K For Search
Digital Reasoning’s Synthesys
Did Web Search Kill Artificial Intelligence?
Social Searching Everything, For Everyone: Product Review

Dinesh Vadhia, CEO and founder of “item search” company Xyggy, has been an active member of the Noisy Community for at least a year, and it is with pleasure that I publish this guest post by him, University of Cambridge / CMU Professor Zoubin Ghahramani, and University of Cambridge / Gatsby Computational Neuroscience Unit researcher Katherine Heller. I’ve annotated the post with Wikipedia links in the hope of making it more accessible to readers without a background in statistics or machine learning.

People are very good at learning new concepts after observing just a few examples. For instance, a child will confidently point out which animals are “dogs” after having seen only a couple of examples of dogs before in their lives. This ability to learn concepts from examples and to generalize to new items is one of the cornerstones of intelligence. By contrast, search services currently on the internet exhibit little or no learning and generalization.

Bayesian Sets is a new framework for information retrieval based on how humans learn new concepts and generalize.  In this framework a query consists of a set of items which are examples of some concept. Bayesian Sets automatically infers which other items belong to that concept and retrieves them. As an example, for the query with the two animated movies, “Lilo & Stitch” and “Up”, Bayesian Sets would return other similar animated movies, like “Toy Story“.

How does this work? Human generalization has been intensely studied in cognitive science and various models have been proposed based on some measure of similarity and feature relevance. Recently, Bayesian methods have emerged as models of both human cognition and as the basis of machine learning systems.

Bayesian Sets – a novel framework for information retrieval

Consider a universe of items, where the items could be web pages, documents, images, ads, social and professional profiles, publications, audio, articles, video, investments, patents, resumes, medical records, or any other class of items we may want to query.

An individual item is represented by a vector of features of that item.  For example, for text documents, the features could be counts of word occurrences, while for images the features could be the amounts of different color and texture elements.

Given a query consisting of a small set of items (e.g. a few images of buildings) the task is to retrieve other items (e.g. other images) that belong to the concept exemplified by the query.  To achieve the task, we need a measure, or score, of how well an available item fits in with the query items.

A concept can be characterized by using a statistical model, which defines the generative process for the features of items belonging to the concept.  Parameters control specific statistical properties of the features of items.  For example, a Gaussian distribution has parameters which control the mean and variance of each feature. Generally these parameters are not known, but a prior distribution can represent our beliefs about plausible parameter values.

The score

The score used for ranking the relevance of each item x given the set of query items Q compares the probabilities of two hypotheses. The first hypothesis is that the item x came from the same concept as the query items Q. For this hypothesis, compute the probability that the feature vectors representing all the items in Q and the item x were generated from the same model with the same, though unknown, model parameters. The alternative hypothesis is that the item x does not belong to the same concept as the query examples Q. Under this alternative hypothesis, compute the probability that the features in item x were generated from different model parameters than those that generated the query examples Q. The ratio of the probabilities of these two hypotheses is the Bayesian score at the heart of Bayesian Sets, and can be computed efficiently for any item x to see how well it “fits into” the set Q.

This approach to scoring items can be used with any probabilistic generative model for the data, making it applicable to any problem domain for which a probabilistic model of data can be defined.  In many instances, items can be represented by a vector of features, where each feature can either be present or absent in the item.  For example, in the case of documents the features may be words in some vocabulary, and a document can be represented by a binary vector x where element j of this vector represents the presence or absence of vocabulary word j in the document.  For such binary data, a multivariate Bernoulli distribution can be used to model the feature vectors of items, where the jth parameter in the distribution represents the frequency of feature j.  Using the beta distribution as the natural prior the score can be computed extremely efficiently.

Automatically learns

An important aspect of Bayesian Sets is that it automatically learns which features are relevant from queries consisting of two or more items. For example, a movie query consisting of “The Terminator” and “Titanic” suggests that the concept of interest is movies directed by James Cameron, and therefore Bayesian Sets is likely to return other movies by Cameron. We feel that the power of queries consisting of multiple example items is unexploited in most search engines. Searching using examples is natural and intuitive for many situations in which the standard text search box is too limited to express the user’s information need, or infeasible for the type of data being queried.

Uses

The Bayesian Sets method has been applied to diverse problem domains including: unlabelled image search using low-level features such as color, texture and visual bag-of-words; movie suggestions using the MovieLens and Netflix ratings data; music suggestions using last.fm play count and user tag data; finding researchers working on similar topics using a conference paper database; searching the UniProt protein database with features that include annotations, sequence and structure information; searching scientific literature for similar papers; and finding similar legal cases, New York Times articles and patents.

Apart from web and document search, Bayesian Sets can also be used for ad retrieval through content matching, building suggestion systems (“if you liked this you will also like these” which is about understanding the user’s mindset instead of the traditional “people who liked your choice also liked these”) and finding similar people based on profiles (e.g. for social networks, online dating, recruitment and security). All these applications illustrate the countless range of problems for which the patent-pending Bayesian Sets provides a powerful new approach to finding relevant information. Specific details of engineering features for particular applications can be provided in a separate post (or comments).

Interactive search box

An important aspect of our approach is that the search box accepts text queries as well as items, by dragging them in and out of the search box.  An implementation using patent data is at http://www.xyggy.com/patent.php.  Enter keywords (e.g., “earthquake sensor”) and relevant items to the keywords are displayed.  Drag an item of interest from the results into the search box and the relevance changes.  When two or more items are added into the search box, the system discovers what they have in common and returns better results.  Items can be toggled in/out of the search by clicking the +/- symbol and items can be completely removed by dragging them out of the search box.  Each change to an item in the search box automatically retrieves new relevant results.  A future version will allow for explicit relevance feedback.  Certain data sets also lend themselves to a faceted search interface and we are working on a novel implementation in this area.

In our current implementation, items are dragged into the search box from the results list, but it is easy to see how they could be dragged from anywhere on the web or intranet.  For example, a New York Times reader could drag an article or image of interest into the search box to find other items of relevance. There is a natural affinity between an interactive search box as described and the new generation of touch devices.

Summary

Bayesian Sets demonstrates that intelligent information retrieval is possible, using a Bayesian statistical model of human learning and generalization.  This approach, based on sets of items encapsulates several novel principles.  First, retrieving items based on a query can be seen as a cognitive learning problem; where we have used our understanding of human generalization to design the probabilistic framework.  Second, retrieving items from large corpora requires fast algorithms and the exact computations for the Bayesian scoring function are extremely fast.  Finally, the example-based paradigm for finding coherent sets of items is a powerful new alternative and complement to traditional query-based search.

Finding relevant information from vast repositories of data has become ubiquitous in modern life.  We believe that our approach, based on cognitive principles and sound Bayesian statistics, will find many uses in business, science and society.

Link to original post

TAGGED: information retrieval, search
Daniel Tunkelang April 4, 2010
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

data analytics in sports industry
Here’s How Data Analytics In Sports Is Changing The Game
Big Data
data analytics on nursing career
Advances in Data Analytics Are Rapidly Transforming Nursing
Analytics
data analytics reveals the benefits of MBA
Data Analytics Technology Proves Benefits of an MBA
Analytics
anti-spoofing tips
Anti-Spoofing is Crucial for Data-Driven Businesses
Security

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

How To Kill It at SEO Like Zappos

9 Min Read

Google+ Is Like 401K For Search

2 Min Read

Digital Reasoning’s Synthesys

3 Min Read

Did Web Search Kill Artificial Intelligence?

2 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?