When Federated Search Bites

My apologies, I am probably going to step on some toes.

First, let me explain what I mean by federated search. Federated search: conducting a search against “n” source systems via a broadcast mechanism without the benefit or guidance of an index. This is somewhat like roaming the three buildings of the Library of Congress looking for a book title … without benefit of a card catalog.

I am speaking specifically about environments where the systems in the federation are heterogeneous, are physically dispersed, were not engineered for federation a priori, and are not managed by a common command and control system.

By way of example, an airline might have a payroll system containing employees, a reservation system containing flight reservations and a watch list database containing people that are not permitted to fly. If this airline implemented federated search the data in these three systems would remain in these three systems. Searches (whether invoked by users or machines) are then broadcast to each source system. Note: Source systems receive queries for information they may or may not have, and as we shall see, receive queries for data they may have but have no means to locate in any efficient manner.

Federated search works fine if the goal is simply a reference system used to answer periodic inquiry. Such systems could be described as forensic in nature – when there is something of interest, one can look for it. Think of such federated search environments as systems where “the data only speaks when spoken to.” If this is what an organization needs, and there are a small number of queries and a finite number of source systems, federated search is a fine option.

Most organizations are not living in a world where “after-the-fact forensic discovery delivered only when asked” is acceptable.

Most organizations have some obligation to make sense of what they know. For example, the airline should know if the person added to the watch list is already an employee or already has a flight reservation. Ideally, the moment such facts become knowable, someone or some system should be notified. Think of this as “the data speaks to itself.” I call this data finds data.

This notion of data finds data implies the “data is the query.” As each new piece of data enters the organization, the organization has just learned something. And it is at this exact moment in time that one (a smart system) must ask: Now that I know this, how does this relate to what I already know? Does this matter, and if so … to who?

Whether the data is the query (generated by systems likely at high volumes) or the user invokes a query (by comparison likely lower volumes), there is no difference. In both cases, this is simply a need for “discoverability” – the ability to discover if the enterprise has any related information.

If discoverability across a federation of disparate systems is the goal, federated search does not scale, in any practical way, for any amount of money. Period. It is so essential that folks understand this before they run off wasting millions of dollars on fairytale stories backed up by a few math guys with a new vision who have never done it before.

I will spare you the gory details of that day in 1996 when I came to witness such a federated search system. Multi-million dollar, very smart, middleware developed over a number of years was sitting atop a reported 2,000 data stores and 50B rows of data. Watching this large federated search system really drove home a series of epiphanies about the problems of federated search. Fortunately, the purpose of this particular system was a reference/forensic system that only had to respond to a relatively low volume of queries, primarily generated by users. And getting an incomplete answer from time-to-time would not be the end of the world.

To explain why federated search bites I will lay out three basic goals, three notional source systems, and four nasty problems (let’s call them challenges). Mind you, the greater the number of source systems, and the greater the transactional volumes, the more impossible it becomes to discover similar data across dissimilar systems (data finds data).

GOALS

Goal 1: Because the data must find the data, this means for every record added or updated in the federation one must determine if this information is related to any other records in the federation. Such discoverability must be able to keep up with transactional volumes therefore must be near-real-time. [Note: To keep this really simple let us say related only means: shares an exact passport number, address, or phone number.]

Goal 2: Users should be able to pose queries themselves. Although, as it turns out, this goal does not matter because the discoverability properties needed to deliver on Goal 1 can just as easily be applied to this goal.

Goal 3: The federated search system must be scalable across hundreds or more disparate source systems. As such, new source systems must be able to be added to the federation without adverse consequence to existing source systems in the federation, otherwise, the greater the number of systems the more unmanageable the environment.

SYSTEMS

Using the airline example, let’s say the three notional systems look like this:

System 1: A commercial-off-the-shelf payroll system (20K employees, <16 CPU’s, 200 transactions a day (subject to data finds data), system running at 90% utilization).

System 2: An airline reservation system (100M reservations, <265 CPU’s, 2,000 transactions a second, system running at 97% utilization).

System 3: A watch list database (subjects of interest) running on a commercial-off-the-shelf SQL database (1M records, <8 CPU’s, 1,000 changes a day, system running at 80% utilization).

CHALLENGES

Challenge 1: How will a new watch listing record containing a passport number (in System 3) efficiently locate related reservations records (in System 2) which share the same passport number? Here is the problem: An airline reservation system is typically designed to search on things like reservation number or fight number and date of departure not passport number. Source systems are optimized for their purpose –maintaining only the necessary indexes. And, if by chance passport number is an indexed and searchable field in the airline system, are the addresses and phone numbers indexed as well? And what about the key values in unstructured comment fields? Due to this issue, federated search can produce incomplete results because a source system may contain related records but cannot find them. Note: It is not practical to re-engineer every source systems to maintain all conceivable indexes.

Challenge 2: How will the payroll system (System 1) keep up with the flood of queries generated by the reservation system (System 2)? Here is the problem: The payroll system does not have the compute resources to sustain thousands of queries a second; it was not designed for that. Now maybe you are thinking why would you do that? Well data finds data is used to construct context (determine what one knows) in order to determine the right course of action. In this oversimplified example, maybe the airline likes to know when current or former employees make reservations so the right offers are made. Maybe terminated employees are not provided the same kind of offers as other former employees. Note: It is not practical to re-host the hardware of every source system such that it will be able to sustain the cumulative transactional volume of the federation.

Challenge 3: New information can be located during the federated search that warrants a re-query of the source systems. This is recursive. Imagine if the query is for a passport number that only exists in the watch listing database. But what if the watch listing database contains a matching record which reveals a new phone number? This newly discovered information, ideally, must be used to re-query the federated systems. For example, maybe there is a record in the reservation system with the same phone number and maybe this reservation contains a new address! Here is the problem: With each new feature discovered one must consider re-querying the source systems (again). Note: The hardware at each source system would not only have to support the transactional volume of the federation – but the recursive queries on top of that.

Challenge 4: Can you be sure all systems, across all the time zones, are all on-line, all at the same time? What if the fourth system added to the federation is a small, desktop application running a Microsoft Access database – will this system be left on-line at night and have high availability, failover system standing by? The issue is: Heterogeneous systems have non-uniform availability.

[Theatrical pause]

Just how sure am I that federated search cannot handle discoverability at scale? How about this: First person to describe a scalable federated search system that delivers on the goals and overcomes these technical challenges … without having to re-host source system hardware … I’ll write you a personal check for $25,000 (see small print below).

So, if federated search is not the ideal approach for discoverability at scale, then what is?

Discovery at scale is best solved with some form of central directories or indexes. That is how Google does it (queries hit the Google indexes which return pointers). That is how the DNS works (queries hit a hierarchical set of directories which return pointers). And this is how people locate books at the library (the card catalog is used to reveal pointers to books).

Once a directory reveals a pointer, you can go fetch it. Federated fetch does scale. Yes, the source system will have to be on-line, in the same way the floor at the library must be open. Yes, the user will have to have access privileges. And yes, there are other challenges like the need to keep the directory current and semantically reconciled (to overcome the recursive issues described in Challenge 3). But, at least these are all tractable problems!

Truthfully, I would love to be proven wrong here for a variety of reasons e.g., the privacy ramifications of having large centralized database directories. Although, on the brighter side, the directory approach to discoverability results in fewer copies of the data floating around. And another plus may be that data governance (accountability, oversight, immutable audit logs, etc.) is going to be vastly easier to manage with a smaller number of central directories.

[Small Print: Offer good for two years from the date of this posting. If you have a solution in mind no need to physically prove it, just explain it on paper in plain English such that the average propeller-head can read it and go “oh yeah, that would work.” But, don’t spend too much time on this as it’s obviously not a fair challenge. I’m just trying to make a point as it seems a number of organizations, each desperate to quickly solve large scale discoverability, are being sold on the notion of federated search. An absolute waste of money.