COMPUTERS &LANGUAGE LING 120
Popular in Course
verified elite notetaker
Popular in Linguistics
This 3 page Class Notes was uploaded by Gretchen Schmidt on Saturday September 26, 2015. The Class Notes belongs to LING 120 at Iowa State University taught by Staff in Fall. Since its upload, it has received 45 views. For similar materials see /class/214431/ling-120-iowa-state-university in Linguistics at Iowa State University.
Reviews for COMPUTERS &LANGUAGE
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/26/15
LING 120 Computers and Language Topic Text Categorization Reading recommended Sebastiani 2005 Text categorization text classi cation topic spotting topic identi cation Goal Given a nite set of categories and a document assign the document to a category Applications 0 Newsjournalism 0 Information securityintelligence including spam ltering 0 Library amp information science Two scenarios 0 Categories known in advance 0 Categories unknown in advance Text categorization uses techniques from natural language processing NLP information retrieval IR and machine learning ML If categories unknown in advance we are interested in nding out what kinds of patterns can be inferred from text 0 Usually some form of clustering of texts in performed ie based on the statistical properties of texts they are automatically placed in groups 0 This might lead to discovering new patterns that were unknown before knowledge discovery If categories are known in advance we perform text categorization How is it done 0 Rulebased approaches popular up until the late 1980s I Subject matter experts wrote rules about document properties in each topic I Very laborintensive amp timeconsuming not terribly accurate 0 Statistical approaches popular since the late 1980s I Assumptions 0 If a word is unusually frequently used in a document it s probably important to that document 0 Documents in the same category use similar vocabulary I Need a way to measure importance of a word term in document a Simple term frequency isn39t useful all function words determiners prepositions auxiliary verbs particles amp conjunctions are frequent in all documents 0 Usually a stop list is used to lter out frequent function words 0 But still simple term frequency isn39t a good measure of term importance The longer the document the higher term frequencies 0 Then we have to normalize term frequencies for the length of documents ie tf ifrequency of term ti in document dj quotT length of d J Still not a good measure Many terms are just naturally more frequent than others We have to see how import a term is to 61 relative to other documents a Document frequency de is the number of documents the term t occurred in o Inverse document frequency is a measure of term importance in general and is calculated as follows idfiloglttotal number of documents dfi o If t occurs in all documents id is 0 The lower d the higher id 0 Then the importance of t in at is calculated as if idfl Need a way to represent documents a Each document could be represented as a vector of tfidfs for all the words in the language minus the stop list 120 00 46 67 0 Then we can measure the similarity of documents based on this representation 0 But that means extremely large vectors Also not all words in the language are good category predictors a lot of words are topic neutral eg come go alive red c We then represent documents as vectors of tf idfs for only a subset of useful words in the language a Similarities of documents with each other are then calculated using wellknown algebraic methods How do we train a document classifier 0 Use a corpus of documents with known categories Split up the corpus into three sets training set validation set test set Extract useful terms Represent documents as vectors of tf idfs Find optimal parameter settings for the classifier using the documents in the training set Validate the parameter settings by classifying the documents in the validation set Repeat the last two or three steps until classifier can t get better Test classifier on test data for reporting results Evaluating classifier performance 0 precision measures what percentage of the documents classified as belonging to a class actually belonged to that class number of true positives precision number of true pos1t1vesnumber of false pos1t1ves o recall measures what percentage of the documents belonging to a class were actually found by classi er number of true positives number of true positivesnumber of false negatives recall 0 f measure combines precision and recall 2precisionrecall f measure prec1s1on recall Reference Sebastiani F 2005 Text Categorization In Alessandro Zanasi editor Text Mining and its Applications to Intelligence CRM and Knowledge Management pages 109129 WIT Press Southampton UK
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'