New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Gretchen Schmidt


Gretchen Schmidt
GPA 3.6


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Linguistics

This 5 page Class Notes was uploaded by Gretchen Schmidt on Saturday September 26, 2015. The Class Notes belongs to LING 120 at Iowa State University taught by Staff in Fall. Since its upload, it has received 41 views. For similar materials see /class/214431/ling-120-iowa-state-university in Linguistics at Iowa State University.




Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/26/15
LING 120 Computers and Language Topic Authorship Attribution Reading Farringdon Authorship Attribution o Ransom letters 0 Hate mail 0 Any other document Similarity Analysis Stylometry o Detecting plagiarism o Authorship Profiling 0 Determining author s cohort Authorship attribution or quotfingerprintingquot Matching a sample of writing to an attested corpus of an author s work 0 Try to nd author s unconscious signature 0 Can t use stylistic dimensions 0 Need statistical tests for signi cance Background 0 Authorship attribution goes back to the 1800s 0 First computerbased study Mosteller amp Wallace on The Federalist Papers 1964 0 Much controversy on statistical methods choice of features meaning of results and everything else 0 Forensic applications I Highprofile results by Donald Foster Vassar College Attribution of A Funerall Elegye to Shakespeare 1989 Matched Primary Colors to Joe Klein 1996 Matching the Unabomber s manifesto to other writing by Theodore Kaczynski 1996 I Big mess re the ransom note in the JonBenet Ramsey case 199799 I Law suit re linking Steven Hatf111to the 2001 anthrax scare Author s signature or fingerprint 0 Average sentence length typetoken ratio 0 Proportional pairs eg since vs because 0 Distinctiveness ratios e g use of some unusual words or common words like although 0 Wordlength frequency distribution e g twothreeletter words 0 Distribution of initialvowel words 0 Some combinations of these e g twothreeletter words initial vowel words a The Federalist Papers httpenwikipediaorgwikiFederalistiPapers Mosteller amp Wallace 1964 o 85 articles 73 known authorship 12 disputed authorship 0 Basic method Distinctiveness ratios of 30 wordtypes according also although always an apt both by commonly consequently considerablely direction enough innovations kind language matters of on particularly probability there this though to upon vigorous while whilst works 0 Measure how likely it is for one author to use the above words as frequently as they did and compare the likelihood ratios 0 Method gives correct answer for all attested papers Greatly favors Madison as author for all disputed papers 0 Other techniques eg cusum techniques 0 httpmembersaolcomqsumsqumIntroductionhtml o Authorship Attribution vs TC 0 Text categorization I Focuses on content words used to detect topic I Surface words reveal topic to a great extent 0 Stylometry genre identification gender identification authorship attribution amp authorship verification I Focus on form of text abstract away from content I Finding reliable features using shallow techniques is very hard 0 Authorship Attribution as TC 0 Components of a text categorization system I Document representation I Selecting suitable features I Representing each text as a vector of frequencies I Dimensionality reduction I Eliminating irrelevant features 0 Learning method I Constructing a model for each category I Testing protocol 0 Previously used features I Frequencies of function words function word de ned loosely I Word length and sentence length I Word tags and POS sequences Reference Farringdon M How to be a Literary Detective Authorship Attribution A brief introduction to cusum analysis available online at httpmembersaolcomqsumsqumIntroductionhtml Mosteller F and DL Wallace 1964 Inference and Disputed Authorship The Federalist Addison Wesley Reading Mass LING 120 Computers and Language Topic Information Retrieval Recommended Reading Tzoukermann et al 2003 Information retrieval The science of finding objects text picture Video etc in any media relevant to user query Text retrieval The science of nding text relevant to user query in a collection of documents 0 Query may be formulated by user or may be preformulated based on a user pro le Dates back to 1890 when Herman Hollerith invented a machine to tabulate US census data Most uses for years were in scienti c legal and medical elds Widespread use with the advent of the Worldwide Web Main components of any IR system 0 Document processing represent documents in a searchable form 0 Query processing represent queries in a searchable form 0 Matching and retrieval mechanism to measure document relevance to query Document processing 0 Documents are usually represented as indexed keywords inverted index build gt 12 school gt 85 The numbers show position of word in a document in characters An inverted index is built for each document 0 To save timespace stop words ie function words are removed rst 0 Also key words are sometimes stemmed ie pre xes and suf xes are removed before indexing There is still debate on stemming due to mixed results Stemming typically increases recall but reduces precision see notes on text categorization I Stemming could be done using c a traditional stemmer see Chapter 3 of textbook exercise 334 number 3 c or a full edged morphological analyzer Query processing 0 A number of models are generally used e g boolean vector space amp probabilistic o Boolean systems Queries are built based on the presence or absence of query terms in documents These systems are very ef cient and are widely used in bibliographic and other database searches Decisions are binary relevant or not relevant Therefore may miss documents that are somewhat relevant 0 Vector space models Documents are represented as vectors of TFIDFs I Documents are then ranked based on their similarity to the query I Similarity is measured based on the geometric positions of the query and documents in the vector space I These models are simple fast and popular Mostly used in Internet search engines 0 Probabilistic models Calculate probability of document being relevant given query Problem is the relevance of documents is not known at the time the query is formed Also this approach doesn39t take into account the frequencies of words in documents Having being devised in the mid1970s this approach is not popular these days Latent Semantic Indexing LSI Based on Singular Value Decomposition from linear algebra this technique allows for the retrieval of potentially relevant documents that contain semantically related words to query terms not necessarily those words per se That is LSI represents documents in a concept space rather than a term space 0 Increases recall substantially but reduces precision o Approaches to improving retrieval 0 Query expansion Add words from most relevant documents to the query e g the more like this feature of Google or add synonyms requires disambiguation Use popularity measures Give higher points to documents which are pointed to o frequently or which are viewed by users frequently 0 Use user profiles Give higher points to documents that are similar to what user has seensearched for before 0 Use Linguistically Motivated Indexing LMI Index based on linguistic phrases e g noun phrases or collocations References Tzoukermann E Klavans J amp Strzalkowski T 2003 Information Retrieval In Mitkov R ed The Oxford Handbook of Computational Linguistics pp 529544 Oxford Oxford University Press


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Anthony Lee UC Santa Barbara

"I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.