New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Gretchen Schmidt


Gretchen Schmidt
GPA 3.6


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Linguistics

This 3 page Class Notes was uploaded by Gretchen Schmidt on Saturday September 26, 2015. The Class Notes belongs to LING 120 at Iowa State University taught by Staff in Fall. Since its upload, it has received 45 views. For similar materials see /class/214431/ling-120-iowa-state-university in Linguistics at Iowa State University.




Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/26/15
LING 120 Computers and Language Topic Text Categorization Reading recommended Sebastiani 2005 Text categorization text classi cation topic spotting topic identi cation Goal Given a nite set of categories and a document assign the document to a category Applications 0 Newsjournalism 0 Information securityintelligence including spam ltering 0 Library amp information science Two scenarios 0 Categories known in advance 0 Categories unknown in advance Text categorization uses techniques from natural language processing NLP information retrieval IR and machine learning ML If categories unknown in advance we are interested in nding out what kinds of patterns can be inferred from text 0 Usually some form of clustering of texts in performed ie based on the statistical properties of texts they are automatically placed in groups 0 This might lead to discovering new patterns that were unknown before knowledge discovery If categories are known in advance we perform text categorization How is it done 0 Rulebased approaches popular up until the late 1980s I Subject matter experts wrote rules about document properties in each topic I Very laborintensive amp timeconsuming not terribly accurate 0 Statistical approaches popular since the late 1980s I Assumptions 0 If a word is unusually frequently used in a document it s probably important to that document 0 Documents in the same category use similar vocabulary I Need a way to measure importance of a word term in document a Simple term frequency isn39t useful all function words determiners prepositions auxiliary verbs particles amp conjunctions are frequent in all documents 0 Usually a stop list is used to lter out frequent function words 0 But still simple term frequency isn39t a good measure of term importance The longer the document the higher term frequencies 0 Then we have to normalize term frequencies for the length of documents ie tf ifrequency of term ti in document dj quotT length of d J Still not a good measure Many terms are just naturally more frequent than others We have to see how import a term is to 61 relative to other documents a Document frequency de is the number of documents the term t occurred in o Inverse document frequency is a measure of term importance in general and is calculated as follows idfiloglttotal number of documents dfi o If t occurs in all documents id is 0 The lower d the higher id 0 Then the importance of t in at is calculated as if idfl Need a way to represent documents a Each document could be represented as a vector of tfidfs for all the words in the language minus the stop list 120 00 46 67 0 Then we can measure the similarity of documents based on this representation 0 But that means extremely large vectors Also not all words in the language are good category predictors a lot of words are topic neutral eg come go alive red c We then represent documents as vectors of tf idfs for only a subset of useful words in the language a Similarities of documents with each other are then calculated using wellknown algebraic methods How do we train a document classifier 0 Use a corpus of documents with known categories Split up the corpus into three sets training set validation set test set Extract useful terms Represent documents as vectors of tf idfs Find optimal parameter settings for the classifier using the documents in the training set Validate the parameter settings by classifying the documents in the validation set Repeat the last two or three steps until classifier can t get better Test classifier on test data for reporting results Evaluating classifier performance 0 precision measures what percentage of the documents classified as belonging to a class actually belonged to that class number of true positives precision number of true pos1t1vesnumber of false pos1t1ves o recall measures what percentage of the documents belonging to a class were actually found by classi er number of true positives number of true positivesnumber of false negatives recall 0 f measure combines precision and recall 2precisionrecall f measure prec1s1on recall Reference Sebastiani F 2005 Text Categorization In Alessandro Zanasi editor Text Mining and its Applications to Intelligence CRM and Knowledge Management pages 109129 WIT Press Southampton UK


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Amaris Trozzo George Washington University

"I made $350 in just two days after posting my first study guide."

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.