Information Organization and Retrieval
Information Organization and Retrieval INFO 202
Popular in Course
Popular in Information
This 82 page Class Notes was uploaded by Judy Hegmann on Thursday October 22, 2015. The Class Notes belongs to INFO 202 at University of California - Berkeley taught by R. Glushko in Fall. Since its upload, it has received 11 views. For similar materials see /class/226680/info-202-university-of-california-berkeley in Information at University of California - Berkeley.
Reviews for Information Organization and Retrieval
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/22/15
2 How to Think About Information INFO 202 29 August 2007 Bob Glushko 10f36 Plan for INFO 202 Lecture 2 Defining Information Models and Modeling quotInformation Architecturequot 20f36 What is Information Many different ways to define it depends on your discipline or perspective or problem A common sense is quotsomething you don39t knowquot or quotnews about somethingquot But news to one person might be quotold newsquot to someone who already knows it So if something happens that we expected we get less information than if something unexpected happens 3of36 Weinberger 40f36 Svenonius quotSomething retrieved or obtained through informingquot quotThe content of a messagequot 50f36 Bates Information is quotthe pattern of organization of matter and energyquot So this exists independently of any observer or recipient but each observer can construct store and act upon it in different subjective ways Environmental and evolutionary factors influence this so there are similarities in how information is experienced Anything human beings interact with or observe can be a source of information Living beings can assign meaning to information but patterns of organization of matter and energy are not inherently meaningful 60f36 Bates39 Fundamental Forms of Information Represented no label no 7 of36 Taylor39s Information Hierarchy Taylor presents an information hierarchy of DATA INFORMATION KNOWLEDGE and WISDOM as quotdifferent levels of comprehension of symbolsquot DATA INFORMATION KNOWLEDGE WISDOM 80f36 Information Hierarchy Process Model Boisot amp Canals 2004 Stimuli Data I 3 Inlovmation Agent Knowledge I I 3 39 Menlal Models I WORLD 39 t gt Values I 39Ferceplual Conceplual lt 39 Fiuers Fmers l Actions I n 9of36 Information Hierarchy Example DATA the value we collect from a measuring instrument is 25 INFORMATION the instrument is a refractometer which measures the sugar content of grapes on the Brix scale KNOWLEDGE the sugar content of grapes predicts the potential alcohol content of wine fermented from them WISDOM If the summer weather is cooler than usual grapes will contain more acid so growers usually pick later with a higher Brix to help balance the tart flavors from the acid Is the Encyclopedia Britannica wisdom ls Wikipedia What about this encyclopedia 10 of36 Reddy The Conduit Metaphor 1 A B 7777777777777777777777777777777777 llof36 The Conduit Metaphor 2 Language functions like a conduit transferring thoughts from one person to another In writing and speaking people insert their thoughts or feelings in the words Words accomplish the transfer by containing the thoughts or feelings and conveying them to others In listening or reading people extract the thoughts and feelings once again from the words 12 of36 Conduit Metaphor Examples It39s hard to get that idea across to him I gave you that idea His words carry little meaning It39s difficult to put my ideas into words That39s not what I got out ofwhat he said 13 of36 Shannon quotInformation Theoryquot 1948 A definition of information that involves neither meaning nor people was proposed by Claude Shannon considering information as the amount of bits that a communication channel can carry Shannon39s Information Theory treats communication as producing the same pattern of bits at the destination as that sent from the source This makes communication a statistical problem and communication failures result from quotnoisequot on the channel that distorts the message The amount of information in a message is computed in terms of the probability that its bit parts occur 14 of36 Shannon Over Space and Over Time Noise Encoding Decoding Source Writing Storage Retrievall Destination indexing Reading 15 of36 Bits vs Atoms Another way to think about information that is a descendant of Shannon is in the contrast between quotbitsquot and quotatomsquot first expressed by Nicholas Negroponte of the MIT Media Lab Information encoded as bits can move several orders of magnitude faster than atoms can It can be in many places at once broadcasting networking unlike atoms that have to be in one place 16 of36 Digital Information Many distinctions between quottypes of informationquot disappear when we look past their tangible renditions or representations Rather than the appearance the model that describes the structure and content of the information becomes paramount But another twist for multimedia information most techniques for quotinterpretingquot multimedia don39t capture or process it at a semantic level 17 of36 Information Quality Do you know the source of the information Is the source authentic Is the information authoritative or credible 18 of36 Information News Uncertainty and Truth 25 r I just feel fortunate to liv much disinformation at my ngertips Instances and Classes Categories We can treat information as a description of or as being attached to an INSTANCE of something a particular realization or implementation or occurrence Or we can treat information as a description of a CLASS or CATEGORY of things that we are treating as equivalent It is important to be precise about which descriptive perspective you39re taking 20 of 36 Information Content vs Container We say quotthe document is about the photograph is about the movie is aboutquot We39re expressing a distinction between information as conceptual or as content and the physical container or medium format or technology in which the information is conveyed But Svenonius defines quotdocumentquot as quotan informationbearing message in recorded formquot which includes the container 21 of36 Another Kind of quotMessage Containerquot Abstracting Away the Container It is very useful to think abstractly about quotinformation contentquot or quotinformation flowsquot without making any assumptions or statements about the quotpresentationquot or quotrenditionquot or quotimplementationquot And we can apply abstract patterns of information exchange to new domains or technologies Abstract patterns of information content or structure are sometimes called architectures 23 of 36 Not All Containers Are Equally Separable Valuable or Usable Not all information content can be separated from its container sometimes the medium is the message But it is important to think of the information content abstractly if you can because that39s the key to representing the same information in multiple formats media or technologies Some information formats or representations are inherently more reusable or adaptable than others 24 of 36 25 of 36 quotInformation IQquot Unstructured electronic text ASCII Printed text more processability l reusability 01 elesueJ1 01491323 Models and Modeling 1 Separating information content from the structure and presentation in which it occurs in quotartifactsquot is a core skill in all the overlapping disciplines in IO This is a key activity in what is called data document information conceptual MODELING Models are simpli ed descriptions of a subject that abstract from its complexity to emphasize some features or characteristics while intentionally deemphasizing others 26 of 36 Models and Modeling 2 A model can represent a human activity a natural system or a designed system We can model structures objects their characteristics their static relationships with each other like hierarchy and reference We can model functions processes behaviors dynamic activities that create and affect structures 27 of 36 An Everyday Model The Recipe A recipe describes both objects and structures ingredients and the processes instructions for creating a food dish You can communicate a recipe to someone else who can then create the same dish You can use the recipe as a guide for experimentation with the objects or the processes in the recipe to create alternative dishes A set of recipes may contain common or reusable objects eg flaky pastry or processes stirfry 28 of 36 Not A Model of Wonton Soup A Model of Wonton Soup 30 of 36 Soug 4 cups chicken broth 2 tsp lite soy sauce 1 ts sesame oil 4 tsp hot chili paste Wontons 12 lb ground chicken veal or lean pork 2 green onions nely chopped 2 tsp grated fresh ginger root fa an lifrazh39 fground blaczpepper cu s re e res spinac 1 tSP COFnStal39Ch 00m our or watercress 1 tsp lite soy sauce 6 drops hot pepper sauce 18 wonton wrappers 1 green onion thinly sliced In bowl combine ground chicken green onions ginger root cornstarch soy sauce and hot pepper sauce mix until well blended Lay wonton wrappers on work surface Place spoonful of ground chicken mixture in center of each one Moisten edges with watert Fold each wrapper in half to form triangle press edges to seal Arrange in steamer basket Cover and steam over boiling water for 3 minutes Meanwhile in large saucepan combine broth soy sauce sesame oil and hot chili paste Bring to boil Season to taste with salt and pepper Just before serving stir in spinach until wilted Place 3 wontons in each bowl Ladle soup over top Garnish with green onion rings 31 of36 The Classical Modeling Approach 1 ABSTRN HON Enema View 4171 n fur Cuncepunl um T0430 Alarm hyucal va ALIvrwluzmnmd Almiulm Ewenm cw M w A nfmn The Classical Modeling Approach 2 We start with the EXTERNAL view of actual instances observations or artifacts We then develop a PHYSICAL model that describes the structures and characteristics of the artifacts as they exist From this PHYSICAL model we create a CONCEPTUAL model that abstracts an implementationindependent view to capture the more fundamental structures characteristics or relationships on which the artifacts are based From this CONCEPTUAL model we can create PHYSICAL models that reimplement the presumably more rational robust model 32 of 36 Modeling Always Involves Making Choices There is an essential difference or gap between the domain being modeled and the model So much of the skill of modeling is knowing what to ignore or throw away when you study something So whenever we try to describe anything whether it is an instance or a class of instances that we are treating as equivalent we are making choices about which quotinformationquot to represent Once we make these choices they shape or constrain how we think about what we39ve described 33 of 36 quotInformation Architecturequot 1 An architecture describes a system39s components or quotbuilding blocksquot and their relationships with each other using hierarchical and compositional structure to define the component boundaries This gives an quotarchitectedquot system an explicit model in contrast with systems that are implemented incrementally without a master plan or without the effort to create reusable abstractions and components IA content information structure navigation structure presentation structure presentation design 34 of 36 35 of 36 Information Architecture 2 Navigation Design Graphic Design Readings for INFO 202 Lecture 3 Weinberger Chapters 7 8 amp 9 Svenonius Chapter 2 36 of 36 11 Information Integration and Interoperability INFO 202 6 October 2008 Bob Glushko Plan for INFO Lecture 11 Understanding the quotIntegration Problemquot quotWhen Models Don39t Matchquot Enterprise Data Integration Business Intelligence and quotDashboardsquot Enterprise Information amp Data Management Goals Run the business more ef ciently through greater automation and quotstraight throughquot or quotendtoendquot processing of the information that it creates and receives Consolidation of data from multiple business units to create a unified view of the customer supply chain etc Get endtoend visibility of business processes Take different perspectives from high level aggregation to resolving individual data anomalies or inconsistency Make better decisions more quickly Enterprise and Interenterprise Information amp Data Management Challenges Internal to a rm application quotsilosquot or quotstovepipesquot may have been created overtime and not have been designed to share information with each other Each of these systems has a specific purpose and a data model customized for that purpose so these models may be incomplete or incompatible with respect to each other This causes problems when business processes can span multiple departments business applications or even multiple firms These problems are greater when information or data comes from outside the firm The Integration Requirement Companies have so many internal with employees and external with customers and suppliers interactions that they must automate as many as possible This requires INTEGRATION the controlled and automated sharing of content data and business processes among any services applications or information sources intra or intercompany Integration has long been a substantial portion of the IT activities in many companies Integration Is NOT Interoperation Integration means that one application can extract or obtain information from another one It doesn39t mean that the information will work quotas isquot for the receiving application or service Different systems may use different formats for nominally the same data items September 11 2001 9112001 and 91101 110901 in Europe Furthermore there may be signi cant semantic differences between data items with the same name Syntactic Structural amp Semantic Interoperability Syntactic interoperability isjust the ability to exchange information It requires agreement or compatibility at the transport and application layers of the communications protocol stack and with the messaging protocol and encoding format Structural interoperability means that all of the expected information components are present with the same arrangement and granularity Semantic interoperability requires that the content of the message be understood by the recipient application or process Interoperability isn39t All or None Some interoperability problems can be detected and resolved by completely automated mechanisms Other problems can be detected and resolved with some human intervention Other problems can be detected but not resolved Some problems can go undetected Transformation in Semantic Integration Semantic integration is the process by which a common semantic quotdata modelquot or quotobject modelquot is created through transformation Easy transformations include field length data type or unit conversion 0 Product attributes can be extracted from text descriptions 0 Classification into product categories eg UN SPSC can be performed using some form of quotnearest neighborquot or principle components analysis What39s the most powerful semantic integration processor Who does the transformation the sender or the receiver Catalog Integration We39ve talked about the need to combine information sources with different datadocument models and semantics in many previous lectures Combining quotcatalog informationquot is a challenge ebusinesses Stonebraker amp Hellerstein but also for digital libraries The goal for both is to create catalogs that describe actual and quotvirtualquot resources that they can provide access to without needing to collect or control them directly Technical Challenges Shared by MultiEnterprise Contentl Digital Libraries Information comes from owners with varying relationships with the integrator Information comes in different formats and semantics Information requires multiple schemas and multiple taxonomies Technical Challenges that Contrast MultiEnterprise Contentl Digital Libraries Digital libraries are integrating metadata while ebusiness applications are integrating both metadata and content from hybrid document types Standards for library content and metadata are better established than those for enterprise content Information is syndicated personalized much more for enterprise content Enterprise information is operational and volatile NonTechnical Challenges Who makes decisions about destination formats Whose responsibility is it to create the destination format Who owns the new information When Models Don39t Match Suppose you publish your web service interface description and tell the world quotmy ordering service requires a purchase order that conforms to this schemaquot This says quotsend me MY purchase orderquot not quotsend me YOUR purchase ordeIquot How likely is it that the purchase orders being used by other firms will be able to meet your interface requirement either directly or after being transformed To Interoperate or not to Interoperate Problem Does my quotpurchase orderquot mean the same thing as everyone else s The Target Model For The Interopergm Scenarios ORDER ORDER LINE ADDRESS NAME LINE ITEM BOOK ITEM The XSD Schema for the Expected Order 1 xs schema xmlns xsquotht p wwww3org2001XMLSchanaquot elementFormDefaultquotqualifie quotgt 1 1mm L by 1 uni ltxs complexType name OrderTypequotgt ltxs sequence ltxs element namequotBuyersIDquot typequotxs stringquotgt ltxs element ame 39BuyerPartyquot typequotPartyTypequotgt equot0 derLinequot typequot0rderLineTypequot maxOccursquotunboundedquotgt ltxs sequence ltxs complexTypegt ltxs complexType namequotPartyTypequotgt xssequence xselement namequotIDquot typequotxsstringquot ltxselement namequotPartyNamequot ypequotPartyNameTypequotgt ltxselement namequotAddressquot typequotAddressTypequotgt Fe ltxs complexType namequotPartyNameTypequotgt xssequence xselement namequotNamequot typequotxsstringquot minOccursquot0quotgt ltxs sequence ltxs complexTypegt The XSD Schema for the Expected Order 2 ltxs complexType namequotAddressTypequotgt cegt element namequotRoomquot typequotxs stringquotgt r ltxs element namequotPostal Zonequot typequotxs stringquotgt L quotCountryquot typequotxs stringquotgt u in KS element nam ltxs complexType namequot0rderLineTypequotgt ltxs sequencegt ltxselement namequotLineItemquot typequotLineIten ypequotgt ltxs complexType namequotLineItanTypequotgt ltxs sequencegt ltxselement namequotBookItanquot typequotBookIten ypequotgt e 39BasePricequot typequotxs decimlquotgt ltxs element namequotQuantityquot typequotxs intquotgt ltxs complexType nemequotBookItanTypequotgt ltxs sequencegt ltxselement namequotTitlequot typequotxsstringquotgt ltxselement nem 39Authorquot typequotxsstringquotgt ltxselement namequotISBNquot typequotxsstringquotgt ltxs sequencegt ltAdd The Expected Instance ltOrdergt ltBuyersIDgt91604ltBuyersIDgt lt e lt amegtMaynard James KeenanltNamegt ltPartyNamegt lt res sgt Roomgt50 5lt Roomgt ltBuildingNumbergt11271ltBuildingNumbergt ltSt V reetNamegt entura Blvd lt reetNamegt ltC1tyNamegt tudio Citylt 1tyNamegt lt ostalZonegt91604ltPostalZonegt F ltCountrygtUSAltCountrygt ltAd ress ltBuyerPartygt ltOrderLi ltLineIt BookItemgt ltTitlegtFoucault 39 s PendulunKTitlegt ltAu horgt mber o EcoltAutho ltISBNgt0345368754ltISBNgt Boo Itemgt ltBasePricegt7 99ltBasePricegt ltQuantitygt1ltQuantitygt ltLineItemgt ltOrderLLnegt ltOr er lt Namegt Identical Model with Different Tag Names 1 ltCustome rgt ltNumbergtKEENltNumbergt ltNamegt James ngt Un1tgt505lt reetNumb u itgt ergt11271lt lt StreetNumbergt Streetgt ltState allforn1altStat gt lt ountrygt AltCoun ry ltLoca 39 ngt ltC ustomergt Identical Model with Different Tag Names 2 ltAeheteurgt lt1 DgtKEENltIDgt ltNomgt 1 M A James ltNomgt ltAddressegt ltAppartmentgt505ltAppartmentgt ltBetimentgt11271ltBetimentgt R egtVentura BlvdltRuegt ltV11 egts u 10 CityltVillegt ltCodePostalgt 1604ltCodePostalgt ltEtetgtCelifornieltEtetgt ltPaysgtUSAlt Paysgt ltAddresse ltAeheteurgt Same Model Attributes Instead of Elements ltBuyerPart IDquotKEENquot NemequotMeynerd James Keenan Roomquot505quot BuildingNumberquot11271quot StreetNamequotVentura Blvd quot CityquotStudio Cityquot StatequotCaliforniaquot PostalCodequot91604quot gt Granularity Conflicts ltAddressgt ltStreetAddressgt11271 Ventura Blvd 505ltstreetAddressgt ltCitygtStudio City 91604ltCitygt ltCountrygtUSAltCountrygt ltAddressgt ltPartyNam gt ltFamilyNamegtKeenanltFamilyNamegt ltMidd eNamegtJameslt MiddleNamegt irstNamegtMaynardltFirstNamegt ltPartyNamegt Assembly Mismatch Separate Customer and Order Documents 1 ltBuyerPar y amegt ltNamegtMaynard James KeenanltNamegt ltPartyNamegt ltA dressgt ltRoomgt505lt Roomgt 39 ingNumbergt11271ltBuildingNumbergt vd s reetNamegt C1 tylt Ci tyNamegt ltPostal Ventur 1 ltC1tyNamegtStudio lt ostalZonegt91604 Zonegt ltCountrygtUSAltCountrygt ltAddressgt ltBuyerPartygt Assembly Mismatch Separate Customer and Order Documents 2 ltOrdergt ltBuyersIDgt91604ltBuyersIDgt ltBuyerPart ltIDgtKEENltIDgt ltBuyerPartygt ltOrderLinegt ltLine1temgt ltBookI temgt ltTitlegtFoucault 39 s PendulunKTitlegt ltAuthorgtUmberto EcoltAuthorgt lt1 SBNgt0345368754ltI SBNgt ltBookIt ltBasePricegt7 99ltBasePricegt ltQuantitygt1ltQuantitygt ltLineI temgt ltOrderLinegt ltOrdergt Conceptual Incompatibility ltAddres sgt ltLatitude directionquotNquotgt37 871ltLatitudegt ltLongitude direction quotwquotgt 1 2227 1ltLongitudegt ltAddressgt Lessons from the Interoperability Examples There are a large number of ways that two implementation models that are supposed to be equivalent can fail that test But no matter how different they look with different syntaxes tag names or assembly models if their conceptual model isthe same it is possible to transform one implementation model to another Validation is not suf cient to guarantee complete interoperability The Dimensions of Interoperability Agreement on Conceptual Model High 09 W Agreement on Implementation LOW High Model The Dimensions of Interoperability Agreement on Conceptual Model High You can You re get there there You can t Don t be get there fouled Agreement on Low Implementation Low High MOdel An Enterprise Information Integ ration Scenario An existing customer calls a service representative to increase an order The service representative must locate information about the customer locate the existing order determine ifthe order can be changed or whether a new order must be created 39 hether r 39L A k A 39 paymclu history and credit What information sources or applications must the service rep consult How can it be done Integration quotBy Eyequot I v The need to consult multiple unintegrated applications to locate information to complete a business process Recent study by Corizon 66 of call center agents use three applications or more to serve customers on a typi I call u 27 use ve or more 0 71 cl 39 ime is wasted on or after a call because of switching between different applications rm Enterprise Portals 1 H W Hazy n e Enterprise Portal Applications l W N WWW Portal applications replace the different interfaces to multiple systems with a single userfriendly screen that accesses only the parts of a backend system that the employee needs Purpose is to create a uni ed experience with a quotsingle signonquot You can think of this trying to recreate something like Yahoo for the enterprise Intranet Nearly every major so ware vendor has created an enterprise portal solution that is an attempt to quotupsellquot from the application server platform Enterprise Information Integration Integration quotby eyequot is inadequate in situations with high transaction rates or complex data and it is necessary for the applications to share data without human intervention This requires true semantic unification of the underlying logic and content models which may or may not be presented to the user as a single quotcomposite applicationquot For large firms and increasingly for mediumsized ones a solution is to implement an ERP system that integrates all of the operational data Business Intelligence ERP and other enterprise systems contain the very granular and quotlive39 operational data of the enterprise ERP systems generate historical reports that are useful for longterm decision making but don39t enable ad hoc analysis of operations needed to make tactical decisions 80 you need another set of your enterprise data organized in a data model optimized for asking questions rather than running your business Generic Enterprise Information Integration Architecture Gantz 2004 Ana lyl amp Reporting Central Data Applicalims Store DWH or ODS Processing Data Warehouses A data warehouse is a quotsubjectoriented integrated timevarying nonvolatile collection of data used in organizational decision makingquot Data warehouses extract data 39om ERP systems and other related business so ware applications into a separate repository It is common practice to quotstagequot data prior to merging it into a data warehouse with an quotExtract Transform and Loadquot ETL application Since the information won39t change denormalization to improve query performance is a common ETL process The data model for the warehouse designed to enable ef cient ad hoc data analysis and reporting is sometimes called a quothypercubequot A common term for the analysis done in a warehouse is online analytical processing or OLAP The Virtual Warehouse A virtual warehouse is created quoton demandquot by centralizing and normalizing metadata about the data sources rather than the data itself The data is le in its original location and extracted only when needed which makes more quotreal timequot analysis and quotbusiness intelligencequot Bad UI for Business Intelligence m 39aiouua mm 2 mm m u mm m m m won 754 39 em v m mm mm um J 153 nm 5 mum am 33 mm mm min mm mm mm Driving Your Business hi The best data warehouse design and the most clever OLAP won39t help the business if the analysis can39t be understood by the decision makers quotDashboardsquot combine information integration with information visualization to enhance the usability of business intelligence A dashboard provides hierarchical views appropriate to different management levels and the means to quotdrill downquot to nd details See idashboardscom or demovisualminingcom Dashboard UI for Business Intelligence Z J Emmi mam FEDEEW xvii R dianforleo ectu39 rg vmz s I h m M Brun J Brown and R Lohde quotAdoption ofUBL in Denmark Business cases and experiencesquot Smita Brunnermeier and Sheila Martin quotInteroperability Costs in the US Automotive Supply Chainquot Arnon Rosenthal Len Seligman and Scott Renner quotFrom Semantic Integration to Semantics Management Case Studies and a Way ardquot 22 Text Processing and Boolean Models IN F0 202 12 November 2008 Bob Glushko Plan for Today39s Class Recall and Precision Models for Information Retrieval Text Processing Operations and Challenges The Boolean Model Overview of Remainder of Semester We begin with introductory conceptual and technical foundations for IR The issues and models get progressively more complicated for the next 5 lectures On the Monday after Thanksgiving we have a lecture on multimedia retrieval with lots of demos followed by a lecture on applications of IR and natural language processing Last new material is quotalumni dayquot December 8 when some former students talk about theirjobs which emphasize IO and IR The last class meeting of the semester December 10 is a course review to prepare you for a threehour final exam on December 15 Schematic View of Classical Search hmmnumi need 39 gnaw 5 1 rim quot I 5 Inform lm mum IR Only Approximates quotFinding Out Aboutquot Question Questioner answerer Queemn I I Met Assessment Recall and Precision ALL DOCUMENTS Releva nt Recall and Precision 2 RECALL is the proportion ofthe relevant documents that are retrieved PRECISION is the proportion of the retrieved documents that are relevant Goal High recall and precision Get as much good stuff as possible while getting as little junk as possible High Recall but Low Precision Low Recall but High Precision Retrieved Relevant High Recall and High Precision 4 Relevant Models of Information Retrieval 1 The core problems of information retrieval are finding relevant documents and ordering the found documents according to relevance The IR model explains how these problems are solved By specifying the representations of queries and documents in the collection being searched And the information used and the calculations performed that order the retrieved documents by relevance And optionally the model provides mechanisms for using relevance feedback to improve precision and results ordering Models of Information Retrieval 2 BOOLEAN model representations are sets of index terms set theory operations with Boolean algebra calculate relevance as binary VECTOR models representations are vectors with nonbinary weighted index terms linear algebra operations yield continuous measure of relevance Models of Information Retrieval 3 STRUCTURE models combine representations of terms with information about structures within documents ie hierarchical organization and between documents ie hypertext links and other explicit relationships to determine which parts of documents and which documents are most important and relevant PROBABILISTIC models documents are represented by index terms and the key assumption is that the terms are distributed differently in relevant and non relevant documents What is a quotDocumentquot in Information Retrieval A document is any individually retrievable item in the quotpile of textquot that makes up the COLLECTION Sometimes the boundaries that define documents are obvious or conventional web search returns a web page but sometimes they aren39t quotCarving upquot or quotchunkingquot large documents into smaller text passages may be required for some collections or some user interfaces A collection might contain any number of documents web search engines index billions of pages What is a Query A query is the expression of a user s information needs and can take many forms A natural language description of the need An arti cial and restricted language Restrictions on the vocabulary limit the words that can be used in queries 0 Restrictions on syntax limit the ways words can be combined in logical expressions These restrictions mean that queries may be unable to express the information need completely or accurately The user interfaces to the IR system influence the kinds of queries that the user can express or express easily Text Processing Motivation Not all words are equally useful indicators of what a document is about Nouns and noun groups carry more quotaboutnessquot than adjectives adverbs and verbs Very frequent words that occur in all or most documents add NOISE because they cannot discriminate between documents 80 it is worthwhile to preprocess the text of documents to select a smaller set of terms that better represent them these are called the INDEX terms Text Processing Operational Overview l 0 01 G l DECODING extracting the text to be processed from its stored representation FILTERING creating a stream of characters by removing formatting or nonsemantic markup TOKENIZATION segmenting the character stream into linguistic units STOPWORD ELIMINATION remove words that poorly discriminate between documents STEMMING removing affixes and suffixes to allow the retrieval of syntactic and morphological variations of query terms SELECTING INDEX TERMS choosing wordstems or groups of them as indexing elements CREATING AUXILIARY STRUCTURES like a THESAURUS Decoding The sequence of characters in a stored document might be represented in any number of single or muItibyte encoding schemes Determining this encoding can be easy file extensions or metadata but not always Text encoding specs are welldocumented but quotcommercial products can easily live or die by the range of encodings they supportquot Guess That Encoding 1 AS 13 uwp 3ahsq7fPb jiAeui Qub i39EIEE V eEO39mstpgZA c MiscoSr 39 57mm 539 K uZN erD U a k avua1 m o 39 1VIFfTEUVK LHX aU VFIPeTEWa Z BESb xii nua pmumj 5 a39 quotmogtuxz mm nU Dex y u c Ih Intii Esn kv cuwgwa olxcev w 56 fgzvmr gumquot y 6gt390 onase 391 5cvw xm a uioagpz x39 Maser cj Bquot A v VEHVZ FIU V QWKia JJXZLR A maaxmvaAacrk 9 to icsx 52am v v 35 ap m 399u zVkaZWvo zzv to Q s 39 Esux Id Guess That Encoding 2 parquot 1 quot m n mllabpal d ql 360i7201i0widctlparj clisttabtx720aspalphaaspnumfaautols 4adjustrightrin0ljn720itap0 Degree products The university currently offers a relatively xed set of degree 93 products3994 lfthe university adopted Dellrquote s 93build to order3994 strategy how might its product offerings change How might this new strategy affect the miversityuquote s 3993market share3994 or 93pro tab ity3994 compared with its current strategy par pardql it L l r L J 1 WM par Numerous documents are currently involved in the endto Guess That Encoding 3 ltPartygt ltName FirsbquotArnoldquot LastquotSchwarzeneggerquotgt ltAddressgt ltStreetAddressgtGovernor39s Mansion 1525 H StreetltStreetAddressgt ltCitygtSacramentoltCitygt ltStategtCalifornialtStategt Routequot1234 quot ltAddressgt ltPhonegt ltAreaCodegt91 6ltAreaCodegt ltLocalNumbergt3233047ltLocalNumbergt ltPhonegt ltPartygt Filtering Removing surrounding header or format information from the text to be processed What you filter depends on the encoding format or document type You39d probably discard HTML markup before indexing You39d almost certainly save XML tags for indexing You39d probably want to use the rich metadata in email mail headers Sentence Segmentation Many IR and text processing applications require that the documents be broken into their constituent sentences Punctuation marks like l quot can make this easy but not always sometimes you39ll say quotDr Glushko this is too hardquot But abbreviations Dr break the obvious rule and even more complex rules like quotperiodspacecapital lettelquot signals a sentence break still makes a lot of mistakes Tokenization into quotWordlikequot Elements Another problem that seems trivial just use white space right But what about abbreviations Dr is a word hyphens sometimes part of a word but sometimes a result of formatting case do we distinguish Bank from bank Tokenization Challenges 1 Character sequences where the tokens include complex alphanumeric structure or punctuation syntax glushkoischoolberkeleyedu 102653 October 26 1953 55 BC B52 12832226140 My PGP key is 324a3df234ch23e Tokenization Challenges 2 Mr O39Neill thinks that the boys stories about Chile s capital aren t amusing For O39Nm ll whier of Hun following And or aren t is it is the desired tokenization Tokenization Challenges 3 The language that the characters represent needs to be identified during decoding because it influences the order and nature of tokenization n languages that are written righttoleft like Arabic and Hebrew lefttoright text can be interspersed like numbers and dollar amounts ldi i ayl y Lat I32 2a 1002 3mg yigsii Himquot I 47 i lt7 i lt7 STAR 39 Ercr ia achim Cd its independence in 1 tiller 31 war ol39Frcncii occupation 39 In German compound nouns don39t have spaces between the tokens LebensversicherungsgesellschaftsangestelIter quotlife insurance company employeequot Tokenization in quotNonSegmentedquot Languages And these problems in quotsegmented languagesquot that use white space and punctuation to delimit words seem trivial compared to problems tokenizing Oriental languages that are quotnonsegmentedquot These languages have ideographic characters that can appear as onecharacter words but they also can combine to create new words The analogous problem in English would be the word quotTOGETHERquot do we treat it as one word or is three separate words quotTO GET HERquot t i i i i i i i i 5 l 1 4V3 4 H 9 El i i i t i 5 Jiil i gi T 18 X v i El D 9 El Wail iN ib39illjJ iilg jT magma Stop or Noise Words Any word that doesn39t convey meaning by itself can39t help us quot nd out aboutquot anything so it can be discarded during text processing In English these STOP or NOISE words include determiners such as quotthequot and quotanquot o auxiliaries such as quotmightquot quothavequot and quotbequot conjunctions such as quotandquot quotthatquot and quotwhetherquot degree adverbs such as quotveryquot and quottooquot These words are always among the most frequent in a collection but high frequency alone isn39t what makes them bad index terms So stop or noise words are usually not determined by frequency analysis text processors usually employ a list of them as a kind of negative dictionary But Stop Words Should be Kept for Phrase Indexing quotPresident of the United Statesquot is a more precise query than quotPresidentquot AND quotUnited Statesquot quotTo be or not to bequot quotLet it Bequot quotFlights to Londonquot and quotFlights from Londonquot aren39t the same query quotLaser printer toner cartridgequot vs quotLaser printer with toner cartridgequot One Minute Morphology MORPHOLOGY is the part of linguistics concerned with the mechanisms by which natural languages create words and word forms from smaller units These basic building blocks are called MORPHEMES and can express semantic concepts when they are called ROOTS or abstract features like quotpastnessquot or quotpuraquot Every natural language contains about 10000 morphemes and because of how they combine to create words the number of words is an order of magnitude greater lnflection and Derivation INFLECTION is the morphological mechanism that changes the form of a word to handle tense aspect agreement etc It never changes the partof speech grammatical category dog dogs tengo tienes tenemos tienen DERIVATION is the mechanism for creating new words usually of a different partof speech category by adding a BOUND MORPH to a BASE MORPH build ing gt building health y gt healthy Morphological Processing Morphological analysis of a language is often used in information retrieval and other lowlevel text processing applications hyphenation spelling correction because solving problems using root forms and rules is more scaleable and robust than solving them using word lists Natural languages are generative with new words continually being invented Many misspellings of common words are obscure low frequency words so adding them to a misspelling list would make it impossible to check spellings for the latter Stemming STEMMING is morphological processing to remove pre xes and suffixes to leave the root form of words Stemming reduces many related words and word forms to a common canonical form This makes it possible to retrieve documents when they contain the meaning we39re looking for even if the form of the search word doesn39t exactly match what39s in the documents In English inflectional morphology is relatively easy to handle and quotdumbquot stemmers eg iteratively remove suffixes matching longest sequence in rewrite rule perform acceptably Derivational morphology is more dif cult Stemming Mistakes Stemming affects the recallprecision tradeof39f OVERSTEMMING results when stemming is so aggressive that it reduces words that are not morphologically related to the same root Organization organ 0 Policy police Arm army UNDERSTEMMING results when stemming is too timid and some morphologically related words are not reduced to the same root acquire acquiring acquired gt acquir acquisition gt acquis Selecting Index Terms At this stage in text processing the text collection is represented as a set of stems But not all of them will help a searcher find what they39re looking for because they will retrieve too many or too few documents We can select better index terms if we analyze the distribution of words stems in the collection 0 We can eliminate some terms entirely We will treat some terms as more important than others in indicating what a document is about The Index Logical View Anthony Julius The Hamlet Othello Macbeth 1nd Caesar Temp 951 Cleopa I m Anthony 1 1 O O 0 1 Elm us 1 1 0 1 0 O C aesa r 1 1 0 1 1 1 a1 plu39m 0 1 0 I 0 O Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worse 139 1 0 1 l 1 0 An index is a data structure that records information about the occurrences of terms in documents This is a termdocument matrix rows for terms columns for documents one such data structure The quotInvertedquot Index Using a termdocument matrix index representation is both infeasible and nonsensical for any substantial collection of documents 80 instead we divide the index into two parts A DICTIONARY is a list of the terms A POSTINGS LIST is the list of documents in which each term occurs usually with frequency and position information within each document BrulusH124l113145173171 Caesar 2 3 616 Indexing Step 1 Term List Doc 1 Now is the time for all good men to come to the aid of their country Doc 2 It was a dark and stormy night in the country manor The time was past midnight Step 2 Alphabetize and Merge Step 3 Separate Dictionary and Postings Dictionary Postings Boolean Queries The simplest query language to implement is a Boolean one because it has a very direct correspondence to the text processing story and indexing story we just told Boolean queries dominate commercial IR systems and are implemented but rarely used for web searches Boolean queries are expressed as Terms Operators Terms are words or stemmed words Operators are AND OR NOT Boolean Expressions Usually expressed with INFIX operators a AND b OR C AND b NOT is UNARY PREFIX operator a AND b OR C AND NOT b AND and OR can be nary operators a AND b AND c AND d DeMorgan39s Law NOTa AND NOTb NOTa OR b NOTa OR NOTb NOTa AND b Sample Boolean Queries Cat Cat OR Dog Cat AND Dog Cat AND Dog OR Collar Cat AND Dog OR Collar AND Leash Cat OR Dog AND Collar OR Leash Interpreting Boolean Queries Cat OFI Dog AND Collar 0R Leash Boolean Search with Inverted Indexes 1 Permit fast search for individual terms For each term you get a list consisting of Document ID Frequency of term in doc optional Position ofterm in doc optional Boolean Search with Inverted Indexes 2 Dictionary Postings QUERY time AND darkquot 2 documents with time in dictionary have DOC DOC2 in posting file 1 document with dark in dictionary has DOWZ in posting file SOLUTION DDC2 satis es query Readings for Lecture 23 1117 Manning Chapter 7 Vector Space Retrieval
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'