A Search Engine Architecture Based on Collection Selection

>>Diego Puppin: I want to try to keep this
talk kind of non-confidential. This stuff was
research I did before Google, before I joined Google. Umm, and no [INAUDIBLE] going to be
published kind of soon. We’re trying to submit on transaction information systems. Umm, so
I would ask you not to say things that are Google
related or whatever. Because we are trying to put these on You Tube or GoogleTalks series
whatever public stuff. So if there are questions entirely related to Google maybe
we can keep that for the end when we stop recording. Umm, so Google doesn’t endorse,
maybe doesn’t like this kind of work. As I told
you this stuff is public but still maybe you didn’t see them yet, so I don’t know. [Diego
laughs] That’s why I’m here. Um, so I represent this work I did on how
to use collection selection on search engine architecture. And notes are available online
here. The URL is kind of easy to remember. OK, the motivation is what you expect so that
the web is growing and more than that it changes the way it’s used. So people are expecting
good information a lot of good information and they expect to have result
fast. The index is growing and also the data that
is stored in the index is growing, you know, there are more qualities seen as, qualities
seen as, that are using index in rankings and finding pages, you know, in the web. Umm,
and parallel, and parallel architecture is probably
the only way to manage this thing. There is no big machine that can hold the just index
in one chunk. There are several research and parallel architecture is the only way to do
it. There are billions of pages that are stored today in index and there is a need for partitioning
data this into manageable manageable chunks. So, distributed parallel
architecture, parallel a parallel information retrieval system architecture is what you
probably know. There is a front end that collects the query from the users and underneath there
are a bunch of servers that are keeping a part
of the index. And each in each index is queried and the results are merged and sent back to
the user. So this is common ya know, simple, simple architecture but the problem is over
here. So, we have an index in information retrieval system the index is a data structure
that matches documents with terms. So, every time a term appears in their web page it puts
a little ‘X’ here and we have a big match that’s
not, that’s you know, showing how terms are appearing various documents. And two main
ways to split these up into separate chunks. One is split them horizontally so one server
is responsible for a set of terms. And the other one’s this where one server is responsible
for a set of documents and knows about all the terms in these documents here. So if you
have this architecture every time you get a new
query you are gonna contact all the servers that are holding terms. So if this server
is holding terms that are not in your query this
will not be working. In this architecture your
gonna probably broadcast the query to all the servers and every servers, server, will
tell you which documents are matching your terms.
So this is the basic idea. There are different advantages and problems with these architectures.
So as I told you if you split the terms only the server with the relevant terms will
be involved in answering the query and this can
be can bring to reduce the computing load of the overall system. But actually the problem
is load balancing because some terms could be more popular or more queried or you know,
whatever, more complex to answer. Also, um, if you go into details there is a network
communication pattern between servers in term partition architectures because they’ve got
to exchange the posting list and it’s more costly.
I won’t go into detail anyway. Um, the document partition has better balancing
but as I told you all documents are queried so all servers are involved into the query.
And we want to try to see a way to reduce the
computing load with this with an architecture that’s document partitioned. So that’s goal
of this research. The main architecture ya know there are other
architecture structural ideas. There is an interface that does caching and collection
selection we will see these into details and on
the back end you still have your search cores. You’re collecting statistic data but the
query’s here. We perform the collection selection so every time we get a new query we choose
which servers are gonna be queried for the specific request. OK, so the main contributions of this research
are the three that are listed here. So the document model is new, is different, and we
represent a document not with the terms that are
appearing into the document but by the queries that are able to recall the document. And
this actually a better representation and allows for more efficient partitioning and
selection. Also there is this new concept of load driven routing so when the queries
send to the servers we also the consider the recent
load at the moment and we base our decision on
this information. And Then there is this new thing that’s called incremental caching that’s
different from basic cache that just in improves the throughput it’s also able to improve
the result quality. OK, I need to acknowledge my co-authors and
other reviewers, umm, that contributed to this
work. OK. Umm, Other contributions I probably I won’t
have time to show today are a more compact collection representation, this is more compact
than the state of the art, CORI. And it’s actually outperforming it’s results. Um, also
there is a new way to find documents that are
contributing to only a tiny fraction of the results. So about half the collection has,
in our experiment, has low quality results and
we can just send them to another supplemental index set of servers or something like that.
And also introduce a new way to update the index so to add new documents to the index.
And a contribution actually, um, it’s relevant in the academic field, and not at Google,
is that actually the test that was quite big for
this center. So it was six million documents and two million queries. It’s nothing like
things we are treating at Google but still. OK so, first of all let’s speak on how to
improve the partitions. So how to split documents into servers so we are able to offer a
better service to users. So we have a document collection and we need to we need a partition
strategy to map the documents onto several servers. And one strategy is just to do a
kind of random assignment and then broadcast the query
to each server. So just to have load balancing we spread the documents all over
randomly or round robin fashion with no control and every time we get a new query we just,
we just, broadcast it to all servers. Ya know there are other strategies like doing document
clustering based on the content or the links or stuff like that. But in the doc I’d like
to show you how to use past usage or information from the query logs to do clustering, to do
document clustering. And the way it works is as
follows: so you have a set of training queries and those can be either the top queries from
your query log or a sample or whatever. And you record every time which documents are
matching which queries. So let’s say query one is recalling document 1, 4, 8, 10 and
12 and so on. And so you’ve got all documents here,
in my case it was six millions; you’ve got all
your training queries, in my case it was about 200 thousand. And then your perform
co-clustering so OK document ‘J’ is returned by the query. And co-clustering is a very
stay is a very robust and fast algorithm that sorts
columns and rows into blocks like this. So your co-clustering is trying to aggregate
ones and zeroes in blocks. And you have a cluster in metrics that can be seen as, ya
know, three by three. You’ve got three document clusters and three
query clusters. So every time we have new query
clusters we just concatenate the queries together and we call this query dictionary and this
will be used later for collection selection. While the document clusters are set one per
server. So every server will hold one of the clusters. So the co-clustering is a very stable
algorithm I told you was presented in literature about four years ago and is able to find
the clustering that minimizes the loss of information between the regional metrics and
the cluster metrics. So it has a very strong theoretical
basis. Um, the implementation can scale up because it’s prone to be parallelized
very well and it’s very robust because you can change the test period, you can change
the number of clusters, you can change the also
the matrix model I won’t spend time on this. But across all these changes the results are
consistent. So it’s very, very, very reliable. They way we use this for collection selection
is that when we get new query we compare it against the query dictionaries we created
and we can rank the query clusters. And then we use
the result of the co-clustering to choose the
document cluster. So we a rank of each cluster. I will show you this in more detail in a
second. So let’s say this is the result of our co-clustering
process. This is some of the scores that appears in the rows and columns contributing
to this little block here. So you are summing the contribution of each document
in this cluster and each and each query from this
cluster. You sum it here you have the metrics; every time you have a new query queue you
do you do just a search of the query against
the dictionaries against the query clusters. So
imagine that you have in the case you have three query clusters and each of them is
described by the concatenation of the queries; and you look for the query into this set of
concatenations. You do, ya know, random, you do usual you do usual information retrieval
stuff like TF-IDF or Okapi 25 all standard things. Anyway, you get to rank you’re able
to rank your query collections and then you do
column by metrics and you’re able to rank the
document collections and you can order them. So let’s say the third document collection
is more promising than the other ones. So can
sort them and you say, ‘this is the first, this
is the second’ and so on. What you do now is that you start querying the servers in
that order. So every time you got a query you start
querying first the most promising servers server and so on. We did some experiments
to verify what was the quality of the results we
could find using a subset of servers, such as the most promising. The experiments was run on a set of about
6,000,00 documents and we used about 200,000 queries for training and we used about 800,000
queries for a test. So this represented three weeks. [Diego laughs] So second compared with
Google. This represented three weeks of queries in
2003 to this small search engine called; I didn’t write it? Totalbr, whatever. So anyway,
this represented a small this represented a real sample of queries to a small search
engine and we tested with queries from the subsequent
weeks. These are unique queries; and these are also repeated queries so this is why this
is smaller. The documents clust, the documents were split
into 16 document clusters and we created 128 query clusters. And we measured the intersection
that is the number of results we could find in a subset of servers related to the full
index. And also we used this more complex metrics
that actually was measured was not just measuring the intersection but the value so the
score of the results. But I mean these are these gave similar results.
This is some mathematics to describe better the quality metrics. But I mean the intersection
is just this is the metrics for intersection so we intersect the results we get from our
server set with the ground tooth, ok whatever that’s simple. Anyway, let’s look at the result one second.
So, umm, we tried different partitioning and different selection strategies and the
first one was just to use CORI. That’s the standard collection selection architecture
on a situation where documents where were randomly assigned to documents. So if we just
do collection selection we tried to query the
first server, the one that’s believed to be most promising, we get let’s say .3 documents
out of 5. If you do mathematics it’s just 1/17 it’s just random stuff. So collection selection’s not helping you
if you have random assignment. While if you do the
co-clustering as I just showed you and use CORI on the same thing you are able to have
a big improvement of about five times. So if you
do if you partition your doc if you partition your documents before hand and use CORI you’re
able to get about 1/3 of the documents that you would get for from the full index just
on 1 server out of 16. The seventeenth server in
our experiment was just the set of documents that were never find never during training.
So all the documents that queries couldn’t find
ever during the training were sent over here. And these are our, if you want, supplemental
index. And didn’t contribute much to the result. So if you look at the main set of
documents using just 1 of our 16 servers you get
1/3 of the results. So and you can choose your point, you can choose your quality by
increasing the number of servers. If you use our new collection selection strategy
you can about 10% better than CORI. This is actually out performing the state of the art. [pause] And this represents what happens. So if you
choose to use only one servers one server or two
servers or whatever for each square you select the most promising and you query just one,
two out of all your set. You have this nice this nice precision curve. So using just one
server you can get about 1/3 of the results you get from the full collection. Um, and
if you want to have, I don’t know, 2/3 of the collection
into to use about half of the servers. But actually this regions very interesting because
this means that if you have a peak load or if you have failures or you need to accommodate
a lot of peak load you’re able to give reasonable results with very, very limited
resources. But we can still improve this. Anyway
there are already some very interesting ideas in this point. So, when you do the partition
the popular queries are driving the distribution and this is good because actually popular
queries will find better results, on average. Also, as I told you, we are representing the
documents in a new way. It’s a lower dimensional space and actually it’s better when you are
also trying to do other type of clustering based on vector space. It’s a more efficient collection selection.
it’s about 1/5 smaller than CORI. Um, and also
interesting you can build this representation while answering other queries. So the search
engine could be running in the mean time and you keep collecting results. I tried to find out what were the weak points
and one could be actually be the that there is
dependents from the training set. But we changed that and it didn’t effect the results too
much. It’s true that we can not manage terms that we’ve never seen before. So when we do
the training we concatenate the queries we got
from the whatever from the users. So our training set is limited vocabulary. If you have a new
term you’re not able to find it in any query cluster. So this is actually a problem. You
can reserve, um, you can go back to CORI. CORI’s
technique they use all terms in the dictionary it’s much bigger for this reason. But
actually if you look at the average, at the ag, at the aggregate result it doesn’t prove
too much. [pause] While our incremental cache technique is actually
helping a lot in this case. And I will, I will, you will see this a lot in a moment.
Also, the collection selection depend from the
document assignment you do at the beginning of the configuration, but actually if you
add documents you don’t break performance and
this was tested. [pause] OK, so let’s see what happens with load distribution.
So this strategy has the same drawback as term partition architecture. So, you know,
some servers could be more popular than others. So there can be a big difference in
load between one collection and one other and
another. Because, for instance, this collection is holding all documents of, you know, very
popular topics and this collection is holding, you know, not so popular stuff. So there can
be a big difference and actually this is a cost to the search engine because there is
a machine that is under utilized. So what we
would like to do is force the load to the same
peak level. [pause] OK the load in this case is very approximate
measure. It’s the number of queries in a sliding window of about 1,000 queries. So
this is just an approximation. But anyway, if you
use this architecture you’re still reducing your peak load from, you know, 100% that would
be serving all queries to all servers to about 1/4. And with this limited load you’re able
to get about 1/3 of the results. So we introduced two load balancing strategies
and in simplifying these the way it works is
like that. Every time you get you get a query and we rank the document clusters for the
query we give it priority to each collection and the priority goes from one for the most
promising server down to zero. So it’s never increasing. And every server chooses to, OK
we broadcast the query and every server answer if its load is more than it’s priority times
the ya know threshold we set. So when this is one. So for the most promising server the
server ‘i’ will always answer. When this is decreasing the server ‘i’ will answer only
if its priority is, um, only if its load is the priority times the load. And this is a
variation that’s not important. But anyway, in this in this configuration a query gets
answered by the most promising servers the most, you know, the ones that were ranked
the highest and also the server at the moment are idle or under utilized. In this configuration we’re modeling the cost
as just one. So every time an ans, a quest, a
server is answering a question, a query we add one and count the number of ones. [Diego
laughs] It’s a very simple model. But we are We are
gonna work on this and we are gonna improve this
in a moment. But anyway if you look at the result with this little strategy you’re able
to push your load from the blue bar to the red
and the yellow one. The yellow is the little, you know, is the different version
I just gave quickly. So this is just a snapshot at a certain point of the load that’s being
held by any by all servers in the configuration. So if you just take a snapshot
of the of the load of your servers you’re able
to, you know, all of them are running to peak, peak, peak power. And actually the results
are improving a long are improving a lot because this is what we had before when we chose to
query let’s say four servers per query. So every this is the initial configuration. So
let’s say we have a query we find the most promising
four servers and we query the top four. Keeping the same peak load that’s about 1/4
of the, ya know, of the 100%. We can get about 10% results better just by using this strategy
that’s using idle load. So without increasing computing power this technique
is about to improve about 10% of the results. And this different version is a little more.
But at, at, at every measurative level we know
added computing load this too can improve your signif, significantly. [pause] Also, OK caching is another interesting point.
When you do collection select, OK, this some background. So result caching is used in all
the web search result web search engines because it’s able to prevent popular queries
to slow down the servers. So very popular queries are not sent to the back end and are
intercepted by the front end and just sending the results back to the user. Um, it’s interesting
also because when you have caching it’s reshaping the distribution of your queries.
And so after caching not popular queries become very popular, you know, compared to the full
set. Here we introduce a new caching system that’s
taking care of the of the collection selection. Um, there is a drawback when you
just do caching with strategy. So if you get if
you have a query and you do collection selection maybe only one or two servers are actually
answering your query. So if you cache you are caching only a part of the result. And
if you are serving this information back to the user
the user will find degraded result. So what we
tried to do with incremental caching is that every time we got the same query we’re gonna
add new results to the cache. So over time popular queries will have a full set of results.
So this is interesting because repeated queries will have the full set of results. And they
will see no negative effect of collection selection. [pause] OK, so what what’s happening now is that the
incremental cache is, yes, is reducing the reducing the load to the backend because it’s
intercepting a part of the queries. But it’s also improving the result quality at the same
time. [pause] OK, and this is the new set of results if
add incremental caching to, to you know, load-driven routing. What happens is that
you’re able to move from let’s say 3.5 results over the five over the top five. You’re able
to get four results out of five. Using the same
load you would have to guarantee that just four servers queried every time. So obviously
that’s a different way. If you configure your system so that you’re able to query at least
to guarantee at least four server per query. With the same peak load you’re able to get
about, ya know, four result out of five. And this is true more or less with different
metrics; and different configurations. So if you change the cache size, if you’re using
a static cache, if you’re changing the training
week the test week. Across all different configuration actually adding no load you
can get you can get very good results. So, before we were using a simple model cost
where a query cost just one but actually we refined these and we used real timing for
the queries; and the results were confirmed. So we
changed the simulation infrastructure so at this time the servers determined their own
their own load measured, for instance, their computing time to our answer a query; and
they sum up that over time; and still they answer
if their load is smaller than the priority of
the query time the, ya know, their computing power. So again the broker, the interface,
the front end is ranking the collections and is giving different priorities to each
collection for the query. So it’s broadcasting the query to all the servers with a tagged
priority. Any server will choose to answer if the instant load is more than the priority
times the peak load; and so if the if the priority’s one for the most relevant server
this will always answer; and for less relevant
servers they will answer only if they are idle or,
you know, under utilized. [pause] And actually it’s important to choose a cost
that’s not simply one because the cost varies a
lot from milliseconds to seconds. So this was actually on, ya know, on the machine at
Miami University, you know, there were other applications
running, ya know, all that stuff. But still it was this was a gross a gross . Anyway so we actually partitioned the documents
onto some different servers. Every server had the local index so every server was able
to answer queries for the documents it was holding; and we measured the timing of each
query. So actually the real timing was just to
compute the load and drive the system. There was a little -a little change here. So before
we measured the peak load of the system and that was our [cough] I’m sorry. That was our
threshold for -for um, for dropping queries; but actually if you take the sum of the load
the peak is very high so actually we did the average load. I mean this is this was a small
correction. Anyway the results were confirmed. So um, actually we’re OK even we’re
confirming actually that are even better because if you’re using collection select, what is
this one, yeah. So if you’re using load-driven strategy and collection selection you’re able
to guarantee about 2/3 of the results you would get from the full index by using the
computing load that’s guarantee you the first the first most author authority server. So
if you are able to tune your system so that at
least one server is guaranteed to be able to
answer to the query. Using the idle load and the caching you’re able to get about 2/3 of
the results with no added computing pressure. [pause] I also did so results on a bigger load. On
a bigger, I’m sorry, query logs. This is actually
2,000,000 queries and results are more or less confirmed. [pause] OK, so this concludes, this was pretty fast.
So there is time for questions. Um, this is what I tried to present an architecture that’s
using heavily collection selection. So documents are partitioned onto several servers
and only this set of servers is used to answer. This load-driven strategy with the
incremental caching can get actually very high
result; and reducing heavily the computing pressure on the system. And it was verified
with different simulations and different configurations. So the way it’s going to be used is like this:
if you are if you’re trying to reach a given precision you can, you know, use fewer servers;
and if you are given a limited number of servers you’re try, you know you can get better
precision; and this was confirmed with different metrics. So I this actually is showing
how to, ya know, this is showing the trade off between cost and quality; and it’s
boosting the, ya know, focus on top results at the expense of low quality results. But
it can be useful if you are serving a very collection of data and you’re, ya know, and
you can afford to lose some results. Also this can be interesting because the load-driven
routing can be used dynamically. So if there is a peak load or if some machines are failing
or whatever you can use this strategy to reduce the computing load to accommodate more users,
more queries. [coughs] Another interesting thing is that still while
even if you are using collection selection and you’re using servers independently there
are ways to have consistent ranking. So when you get results and you merge them out to
the servers the results are actually consist with what we what you would get with a with
a full centralized index. And at the end I was telling you briefly before
incremental caching is actually good also to reduce the negative effects of selection.
So if you have terms that you have never seen before and you are not able to do a good collection
selection but if that query’s for some reason popular. So there is a new term that becomes
suddenly popular like, oh I don’t know, iPhone or whatever. What’s new today? Anyway, so
if you have something new and everybody’s asking about that the query will get populated
and you will get full correct results for the new term. There are other things that are very interesting.
There are very good appli, there are some more potential, there is some more potential.
So. If you are doing posting list on the local index>>Male audience voice: Sorry, what’s a posting
list?>>Diego Puppin: The posting list is the line
in the if you have your document and term metrics the posting list is the line. So it’s
the line for every term that gives you all the documents more or less. So if you doing
if you’re caching terms on the local index you will get some more focused queries because
only queries that are relevant to that server will get there; and so actually the caching
will have less, you know, there will less noise for the cache; for the local cache in
the server. Also there is a way to add new documents that’s
not, um that’s not breaking performance. So there is a way to add documents to the collection
without breaking the performance. And actually it’s very important because you wanna be sure
that when the collections are changing that your collection selection function is still
working well. And at the end there are ways you can improve the caching. Not just by using,
I don’t know, list reason to use algorithms. LRU algorithms for the cache. But for instance
you could use information based on query frequency or maybe the number of servers you already
polled. So if you have a cache line that’s fat because you you’ve queried a lot of servers
maybe you want to keep that over time instead of swapping it out. And that concludes, I’ve got some backup slides
if there are some questions on, you know, the training on adding documents; and otherwise
we can just call it a day. Thanks for coming. [applause]

One Comment

Add a Comment

Your email address will not be published. Required fields are marked *