3-1-1-11-web-search-and-information-retrieval.mp4

3-1-1-11-web-search-and-information-retrieval.mp4


Hello, folks,
this is Geoffrey Fox again, and we’re doing the Big Data
Applications and Analytics course. Data Science Curriculum,
School of Informatics and Computing, Indiana University, and
we’re in Lesson 12. The Course Motivation, which is Unit 2 of
this 33 unit course. This, I think, is the longest unit, cuz it covers everything in
a very superficial level, and now we’re gonna look at web search
with information retrieval. As one of the major commercial
examples, remember we’ve done so far as examples, three major
examples, physics, recommender engines for e-commerce,
and movie and media sites. Now, we do the other major
commercial area, web search. The ones we haven’t discussed, network sites like Facebook and
things like that. Twitter actually has some
relationship to this, because it’s pretty much text. All right, one of the types
of things involving if you were doing the analytics for
a web search and so on. Well, you need to get the data, that
you have to get it from the web, or you have to scan it all
in like Google Books. So, that means you need to
crawl around the whole web, and that as far as I know is nontrivial,
but to effectively solve problem, and of course, you have to do
it in a responsible fashion. People will tell you not to do it, you have to observe their
interests and so on. Anyway, it just requires a lot
of robots crumbling out there to the web, gathering data from sites,
and bringing it back. When you get the data you need, run it through reprocessing
which does two things. They of course for new links,
which new links of data check your ad if you’ve got it,
go out and get it again. Of course, you also have to
update your data continuously. And you look for, when you
pre-process the data, you look for things that people are going
to do queries on, words and their positions, so you can do
phrases, we call the position. So, words, you form a circle
in the inverted index, which maps words to documents,
so that when you’re querying for a particular word, you immediately
get a list of documents of habit. You quantify the value
of that word match, and that has two, or
thee important aspects. One is the classic information
retrieval thing called TF-IDF. TF is term frequency. IDF is Inverse Document frequency,
which ranks the things depending on how often it appears and
the relevance of that document, and that’s used to quantify
the importance of the word match. My inverse document frequencies, how often, how many documents
this particular word appears in. TF is how many words appear
in the particular document. Then we try to look at the relevance
of documents, where things like PageRank comes in, which effectively
counts how often there’s some, and then while fashion the particular
pages, pointed to with links. Of course, you could in principle
also look, you probably have data, how many users access this site. Then we have lots of advertising
technology, cuz you want to do [COUGH] best job you can, and supporting adverts, cuz adverts
are how you get your money. Remember, I told you at the
beginning of all this search stuff, nobody knew how to make money. But then people realized
it was just advertising. You have to do a lot of
reverse engineering, so that you make your site very
attractive for search engines. And get lots of people
coming to your site, which gets you lots of money from
the advertisers on your site. And of course, the search engines
are doing the prevention of reverse engineering and
pushing you for doing that. Then we need to do more
sophisticated things like I mentioned earlier like latent
factors and topic analysis, which is typified by Google News and other
such things which aggregate data. Aggregate websites in
particular according to their being related together. It’s effectively
a generalized cluster. All right, so that’s web analytics. Pretty good,
it’s pretty well understood now and relatively mature except for these
subtle things like advertising. This particular picture is
a well known one from Wikipedia, which shows you each
happy face is a webpage. Here’s a yellow webpage and
a green webpage. And this, the finger pointing means there
is a URL from here to here. And you will see that there are more
finger points at the yellow than there are at the others. And PageRank measures not only
the number of finger points, but it weights the finger points
by the importance of the page. So, the yellow one has
the most finger points, this big red thing here only has
one point from the yellow, but the yellow is such
an important page, the fact that yellow points to
the red makes it important. Anyway, you will see
various sizes here, that’s just proportional
to page rank for this little 1, 2, 3, 4, 5, 6… 11, collection of 11 websites. PageRank is a very well understood
technology and, of course, is implemented with today
probably in a very sophisticated fashion hidden deep inside Google
and Microsoft Bing and Yahoo, who will never tell you,
but they really do. All right, so if we want
to look at the application, recommend assistance of web page,
this comes from a Yahoo tutorial. You’re trying to effectively,
I told you, everything in life is
an optimization problem. So, when we’re designing web
pages especially for portals, or news pages,
we’re trying to optimize money. Because you’re gonna also
try to optimize getting information to people, but
that’s not, that is straightforward. You just put the information
on the page. But we’re trying to do either
way that optimize money, and that money comes to various sites,
they may, there is somebody who owns the website, but if that website
actually allows you to buy something, that’s gonna give you
money to the person who sells it. So, you have to serve
the right item to a user, for the given context, to optimize your
long-term business objectives. And you need to do it in a way
that makes the user happy, and said that you recommend
sensible things, because you want
the user to come back. Cuz then, they will want to
spend more money and so on. So, this is a scientific discipline
it has lots of things building it. Large scale machine learning, which is closer in relation
to the statistics. Then you have online models,
real time models, streaming models. And you have batch models, which are
looking at the overall structure. There’s a difference between streaming
clustering ,and batch clustering. And you need to do this in
an exploratory fashion, and you need to do testing to
find out how to do this. And everything has to be,
in principle, dynamic. And this page needs to be
constructed in real time, according to exactly what you
know at this very moment in time. And you have lots of objectives, cuz up here we listed
lots of objectives. Click-rates, Engagement,
which is how long you stay, there’s is no point in which
users click and never come back. The money you get, the diversity, that you don’t keep recommending
the same thing and so on. We need to look, try to understand
what the user is, we need to effectively calculate user profiles,
you’ll see how valuable those are. So, Google knows now so much
about me, it’s always making good recommendations, better
recommendations than it used to, because it has a much
better understanding of me. And of course, that’s how all
of these modern systems work. And then, we need to actually
interpret the intelligently the information and things we get, which is natural
language processing. And that allows us to look at
things like breaking news, and topics, and getting things,
which are relevant and so on. Here we’re looking at a page
where we need to apply these principles to. It’s got a rich set of the typical
portal page with advertisements. List of different
types of things and we all have to design
what to put here. According to this
multi-objective optimization. And then next,
if we just go through the animation, we can see that, here,
we’re recommending applications. That’s this set here. Here, we’re recommending
search queries. Here, we’re recommending packages. And with an image, title, and a
summary, and a link to other pages, here we have links to other
pages about octopuses. And related concepts,
cut winners or losers, which they think I
might be interested in. If we go through this,
we can see that here, in this area
here we have 4 items, those 4 items I’ve chosen
out of a group of 20 that’s which typically is that you
don’t want to have too many. That number, our total number, is
chosen on the basis of experience. And is all done dynamically,
again, to. Help make certain I’m happy. And they come back to this page, cuz I find your Yahoo very
satisfactory and so on. And this is all routing traffic your
other pages these things are linked to other pages, and those other
pages are gonna be very grateful to me, they might even pay me money. If I click on this octopus
picture and go and buy a car, or whatever that octopus picture
is trying to make me do. So, that’s pretty important. So, what I’m looking this content,
I need to have in the simplest case. I have a content module and I have
an inventory of possible content. And then, I can produce some bias
that I’d like a certain type of content, cuz I need to
recommend the content, and some fashion that improve the overall
rate of clicking on this module. So, the people are pretty
interested in this module. And then,
I manage to make the click improve. But I also have additional
information downstream as I further click through. And I want to increase the value
of these downstream utilities. There’s no point in people clicking
if they’re not gonna go and buy the car at the car dealership,
if they’re just going to. So, we need to make certain
we attract the right clicks. And I need to look at this
in a simultaneous fashion we looked at the previous page. It had lots of different components,
and we needed to maximize,
optimize those components together, as a single overall
multi-objective optimization. So, it’s all pretty non-trivial.

Leave a Reply

Your email address will not be published. Required fields are marked *