Concepts in Computing with Data
Concepts in Computing with Data STAT 133
Popular in Course
Popular in Statistics
This 13 page Class Notes was uploaded by Floy Kub on Thursday October 22, 2015. The Class Notes belongs to STAT 133 at University of California - Berkeley taught by Staff in Fall. Since its upload, it has received 44 views. For similar materials see /class/226730/stat-133-university-of-california-berkeley in Statistics at University of California - Berkeley.
Reviews for Concepts in Computing with Data
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/22/15
File Management Commands Ctd rmdir directory name remove directory it must be empty to be removed pwd gives you the name of the current working directory cp le1 le2 copy the first file to the second file mv le1 le2 rename ie move the first file to the second file name rm lename remove the file ls list the files and directories in the current directory more lename shows the contents of a file 0 tail lename shows the last lines of a file rTypesetby FoilTEXr 2 Is list directory contents From within the Test directory that we set up we see the directory A and the two files x and file1tex gt 1s A myfiletex X The R option say recurse your way through the tree of subdirectories excuting the Is command as you go gt ls 7R A flleltex X A B C flleZteX f11e3csv A B D flle4doc flleStXt ABD XXZ AC x rTypesetby FoilTEXr 4 File Management Commands o mkdir directory name create a new directory 0 touch le name create a new file with no content or touch an exisiting file 0 cd directory name change directory to the one mentioned The directory name can be relative to where you are or absolute Suppose you are in the directory homenolanstat133 and it contains a directory notes The nextthree commands all change the directory to homenolanstat133notes cd notes The directory name is relative to the current director cd Ihomelnolanlstat133Inotes Here we have supplied the full pathname to find the director cd 7stat133lnotes We use the symbol to refer to the user s root directory which in this example is homenolan cd I This relative change is to the directory above the current directory ie to homenolanstat133 cd Istat205 This relative change is to the directory at the same level of the tree as the current directory ie homenolanstat205 rTypesetby FoilTEXr i Example Let s create a system of files and directories that matches the handout from Friday gt mkdir Test Testgt Testgt touch fileltex Testgt mkdir A Testgt mkdir AB Testgt TestAgt touch file2tex TestAgt touch Bfile4 doc TestAgt touch Bfile5txt TestAgt mkdir BD TestAgt touch BDx TestAgt touch BDXZ rTypesetby FoilTEXr 3 the wild card The symbol matches any number of characters except the I It can be quite handy when looking for files that have particular type Below we list only those files with the filetype extension of tex in the subdirectory A gt is Atex Areporttex gt is 7R tex ls No match Can you explain why the second command does not find any files with the filetype tex rTypeset by FoilTEXr 6 find search for files in a directory hierarchy The fi nd command may be better suited to the previous task gt find ename Wtex Areport tex We can also do fancier finds such as all tex files in my home area that have been modified within the past 21 days gt find homenolan emtime 21 ename Wtex l wc 71 2421 Or we can find all those files that do not end with tex in the working directory gt find etype f enot ename Wtex rTypeset by FoilTEXr 8 Help for commands man Is the online manual pages forthe ls command ls hep abbreviated help on the various options to the Is command Usage ls OPTION FILE List information about the FILES the current directory by default Sort entries a phabetically 1f none of 7c tu X nor quotsort Mandatory arguments to long options are mandatory for short options too 3 77311 do not hide entries starting With 7A realmostrall do not list implied and erauthor print the author of e ch file 7b reescape print octal escapes for nongraphic characters erblockrslzeSIZE use SIZErbyte blocks 78 reignorerbackups do not list implied entries ending With N 7c With elt sort by and shoW ctime time of last 7C li t entries b co umns recolorWHEN control Whether color is used to distinguish file rd eedirectory list directory entries instead of contents 7D rrdlred generate output designed for macs dired mode if d ot so enable raU disable elst 7F rrclasslfy append indicator one of l to entries rTypeset by FoilTEXr 5 Piping i We can construct more complex shell commands by piping or sending the output from one comand to the input of another Below we pipe the output from the Is command to the we comand gt is 7R l wc 33 2 9 140 we short for word count returns the number of newlines words and bytes in the input file Does this mean that there are 33 files in the Example directory rTypeset by FoilTEXr 7 grep continued To invert the match we can use the v option to the grep command gt re ele data frame more DataTypespdfFilestex DataTypesExampleLateXZ1og DataTypesExampleLateXZaux DataTypesEXampleLatexZpdf oreee Here the more command formats the output being sent to the console so that you see one page at a time rather than have the whole output go whizzing past To proceed to the next page of output hit the space bar To stop displaying the output hit q rTypeset by FoilTEXr 10 total 4 drwxrwxrex 4 Helen Helen 4096 Sep 19 1154 A erwerweree 1 Helen nolan 0 Sep 19 1153 file1tex erwerweree 1 Helen nolan 0 Sep 19 1153 X 0 sort sort lines of text files 0 uniq remove duplicate lines from a sorted file 0 cat concatenate files and print on the standard output 0 tail output the last part of files Process controls 0 ps report a snapshot of the current processes 0 kill send a signal to a process More advanced shell techniques 0 sed a stream editor A stream editor is used to perform basic text transformations on an input stream a file or inputfrom a pipeline o It is possible to write shell scripts ie programs with loops and variables and excute these at the command line rTypeset by FoilTEXr l2 grep print lines matching a pattern With grep we can search for patterns within a file The syntax is grep options PATTERN FILE Suppose we want to find all files that have the words quotdata frame in them gt grep 71R data frame DataTypesDataTypestex DataTypesAssignmenttex DataTypesListstex Introductionintroductiontex ShellCmdsShellCmdstexswp Traffictraffictex schedule rTypeset by FoilTEXr 9 Other useful aspects of the Shell commands o gt and gtgt redirection At times we wantto save the output from a command to a file We can do this by redirecting the output to a file We use the single gt to direct the output to a new file it will also overwrite an existing file and we use the double gtgt if we want to add the output to an existing file 0 chmod change permissions on files and directories In Unix we can define groups of users such as all student accounts for Stat133 or all faculty accounts or all graduate studnet accounts The chmod command allows us to set specific permissions for the owner of the file the group that the owner belongs to and for all other accounts The option on the Is command provides the permissions Testgt 1s 71 total 4 drwxrexrex 4 Helen Helen 4096 Sep 19 1154 A erwereeree 1 Helen nolan 0 Sep 19 1153 file1tex erwereeree 1 Helen nolan 0 Sep 19 1153 X Testgt chmod gw Testgt 1s 7 rTypeset by FoilTEXr M Advanced Topics in R Programming Ba rch Jobs Garbage Collec rion Memory Managemen r Debugging Duncan Temple Lang duncanwalducdavisedu Topics In rhe 2 lec rures I will present we ll rry ro cover General ques rions R ad hoc ne rworks programming e rc Ba rch amp Background jobs Garbage collec rion Managing memory Debugging Recursive Func rions No res a r h r rpeeyoreucdavisedus ra r133 Batch Jobs Usually run R commands in rerac rively Bu r if rhey fake a long rime you wan r ro leave rhem and come back when rhey are nished Can lock rhe screen BAD Ins read use a ba rch or background job using rhe shell Impor ran r par r of Scien ri c Compu ring Ba rch Jobs Pu r rhe R commands info a le say codeR Run R reading commands From rha r le pu r ou rpu r in ro ano rher le R no save lt codeR gtamp ou rpu rRou r no save jus r rells R no r ro bo rher saving rhe work space when if nishes O rher possible op rions are vanilla save no environ e rc See documen ra rion For R shell command S rar rup 4 R no save lt myCodeR gtamp outputRout R quotVanilla lt myCOdeR gt8lt OUtPUtROUt Whaf does 9 lt mean The gtamp means redirec r bo rh ou rpu r and errorsquot ro rhe The shell redirec rs inpu rquot ro R using rhe le quotWPWRRoqu con ren rs of lhe File myCodeR If we jusf had gt outputRout Very similar ro ryping rhe lines one al a rhe errors would go ro rhe console rerminal rime a r rhe R promp r No r qui re rhe same as sourcd mYCOdeR39JI The gtamp is speci c ro rhe c shell cshlcsh buf Close39 For rhe Bourne shell bashsh use R no save lt myCodeR 2gt1 gt ou rpu rRou r 5 6 5 nohup nice 18 R no save lt myCodeR gtamp Background Jobs outputRout amp The second amp rells rhe shell ro pu r rhis process in rhe background and re rurn rhe a new promp r We s rill had ro wai r For rhe No connec rion ro rhe gtamp R no save lt myCodeR command f0 nish before we sfarf new nice says schedule my job when o rhers aren l using rhe command in rha r rerminal ComPu rerl IF we logou r rhe process will rermina re 18 is The maXimum amounquot 0F iceness We wan r ro ge r a new promp r so we can do Pre x Command Wifh Ohup 39 0 hangup ofher fhingsl including logging out On many machines rhis is no r needed bu r i r never hur rs and guaran rees rhe job keeps running when you logou r Things to Remember Can logou r and re rurn la rer ro see if rhe job is nished Firs r remember which machine you used Of ren people check on rhe wrong machine Has rhe rask nished IF you arrange For your code ro generale ou rpu r a r differen r poin rs you can look a r rhe ou rpu r le and look For rhose markers eg prin r some rhing a r rhe end of i rera rion of a loop To look a r rhe le cat outputRout or tail f outputRout 9 General Job Moni roring Each job or process has a unique iden ri er a number kesrelgtappbinR nosave lt longR gtamp ou r amp 1 19766 The 19766 is rhe process iden ri er Use rhe commands top and p5 nd s ra rus of machine and job Use kill ro Force a job ro nish ki 11 9 19766 More ro remember When crea ring plo rs explici rly open graphics devices and close rhem dequotmyPlo rdequot his rx devoFF This avoids rhem going ro one big le on differen r pages And more If your job s rops unexpec redly you will have ro s rar r again From rhe beginning Some rimes useful ro save resul rs as you go along ie a r differen r s ragespar rs of lhe scrip r savea b c le myFilerdaquot Then you can come back and reload rhem and con rinue on From rha r poin r or do addi rional compu ra rions Debugging IF you get an error in your script the job will stop and there will be a message in output Rout Hopefully the message will make it clear how to x the pro lem OFten we need to examine the state oF the session to gure out why things Failed So we need to be able interactively explore the values oF the diFFerent variables Postmarten Debugging First oF all test code on smaller datasets But iF it does happen in a batch job we don39t have interactive access Can39t use optionserror recover Do post mortem debugging see 7debugger At start oF script myCode R put optionserrorquotedump framesto THEZTRUE Then aFter the error can explore in new R ssion 39load 39lastdump rdaquot debugger39lastdump Debugging This debugger is basically the same as the one used in interactive use eg with optionserror recover Jump to diFFerent calls nd out what variables are available print values do computations Debugging is an arts Get experiences Think about probable causes and then try to construct experiments to veriFy that is the reasons What is Garbage Collection Notice that in R when you create data you don39t have to explicitly declare or allocate i And you don39t have to release its e g x 2x 10 rnormlengthx the morms are created added to the other components and then discarded Same For original x Garbage collection is the process oF reclaiming the memor that is associated with objects and computations that are no longer being used garbage When R needs memory ro do a compu ra rion i r asks i rs memory manager For space The memory manager has already alloca red a of oF space rha r i r doles out and so if can provide space For such reques rs IF rhe memory manager doesn r have enough space For rhe reques r rhen i r rries ro cleanup garbage collec r I r runs rhrough all rhe spaces rha r if has given ou r in earlier reques rs and reclaims i r iF if is no longer being used IF rhe Mem Mgr s rill needs more space if can grow i rs pool 17 Prealloca re Space For the Resul r Rei rera ring wha r Deb covered las r rime Consider rhe Following code ans numeric Fori in lzn ans cbindans Fooi In each s rep we combine rhe new resul r wi rh rhe previous ones via cbind 13 17 Consider rhe las r i rera rion ie i n The resul r From rhe previous i rera rion is a ma rrix wi rh n 1 columns We rhen crea re a new resul r wi rh n columns So beFore we assign rhe new resul r ro ans we have approxima rely 2 copies oF rhe resul rs And we have ro copy all rhe da ra From rhe original ro rhe new resul r This is bad news Some compu ra rions will no r be Feasible 18 Al rerna rive We know rhe resul r is a ma rrix oF size m x n so alloca re i r Firs r and rhen assign each i rera rion s resul r in ro rhe corresponding column ans ma rrixNA m n Fori in lzn ans i Fooi This does rhe alloca rion For rhe resul r jus r once and doesn r crea re new objec rs jus r modi es rhe exis ring one The key rhing is rha r ans i doesn r crea re a new copy oF ans bu r wri res rhe values in ro rhe appropria re subse r 19 Time Comparisons sys rem rimeans numeric fori in 110000 ans cbindans rnorm10 1 1457 472 1962 000 000 sys rem rimeans ma rrixNA 10 10000 fori in 110000 ansi rnorm10 1 032 001 034 000 000 Of course need ro have mul riple measuremen rs And rhe charac reris rics of rhe machine e rc ma r rer bu r s rill can compare rhe rwo meaningfully 21 We could use apply ro make rhis read more easily and be more ef cien r sapplylzn func rioni rnorm10 The apply func rions alloca re rhe resul r space for us No re rha r we can de ne an anonymous func rion in rhe call ro sapply func rions are rs r class objec rs in R Bu r when we can39f use an apply func rion making space and wri ring in ro rha r exis ring space is much fas rer 22 Why do we need ro know This Because when you run simula rions as for your curren r project you may run in ro memory problems If rhen helps ro be able ro reason abou r rhem If is good ro be able ro de rermine approxima rely how much memory you will need in a compu ra rion Then you can de rermine if if is Feasible or not And if can also allow you ro specify hin rs ro R for how much space if will need and can reserve 23 22 Before we discuss how ro confrol garbage collec rion le r39s jus r see when if happens gcinfoTRUE rells R ro prin r some rhing on rhe screen when if performs rhe garbage collec rion gt gcinfoTRUE Then alloca re a big objec r several rimes gt fori in 110 m rnorm100000 24 23 24 Garbage collec rion 5 401 level 0 88686 cons cells Free 25 28 Mby res oF heap Free 47 Garbage collec rion 6 501 level 0 88672 cons cells Free 25 21 Mby res oF heap Free 34 Garbage collec rion 7 601 level 0 88658 cons cells Free 25 13 Mby res oF heap Free 21 So 3 calls ro rhe garbage collec ror when if needed more space and was done wi rh rhe earlier versions oF m 25 Size oF an objec r Find ou r how big an objec r is Back oF rhe envelope calcula rions or objec rsizex eg objec rsizema rrixpi 100 100 1 80120 100 X 100 10000 elemen rs each elemen r 8 by res For a number ex rra by res 120 associa red wi rh R inForma rion dimensions class e rc objec rsizele r rers 1 1068 26 25 gt p cbindruniF75 runiF75 gt D asma rrixdis rp gt objec rsizeD 1 51276 This is rhe number oF by res rhe dis rance ma rrix occupies 27 26 objec rsize can be used ro approxima rely de rermine how much space you migh r need For a calcula rion Things are more complex rhan jus r being rhe sum oF all rhe necessary space as R may reuse some oF rha r space and so need less space And R may use a li r rle more For an objec r Bu r i r gives us a good es rima re oF rhe approxima re requiremen rs Also we can compu re rhis For diFFeren r sizes oF inpu rs and ex rrapola re 27 Suppose we have a compu ra rion rha r works on an n objec rs eg 75 nodes in an ad hoc ne rwork We wri re a Func rion ro de rermine if rhe ne rwork is Fully connec red How many nodes can we realis rically run rhis For Try ii For n 5 10 15 20 100 and compu re rhe ro ral memory and rime used during rha r compu ra rion Then ex rrapola re ro ge r approxima re memory and rime as a Func rion of n 29 Find ou r how much space is being used using gc Forces garbage collec rion and also repor rs s ra rus gt gcreset TRUE used Mb gc trigger Mb max used Mb Ncells 171168 46 350000 94 350000 94 Vcells 62857 05 786432 60 337583 26 The Vcells are rhe in reres ring ones For us our compu ra rions and da ra No re rha r gc re rurns a ma rrix wi rh 2 rows 6 columns We can ex rrac r rhe max used columns and compare across calls 30 Call gcrese r TRUE run compu ra rion and rhen call gcrese r FALSE gt orig gcreset TRUE gt sapp1y110 simulation gt end gc end Vcells 6 orig Vcells 6 29 31 30 gcinFo objec rsize gcrese r TRUE Followed by gc 32 32 Start R with a large workspace R min vsizevl max vsizevu min nsizenl max nsizenu max ppsizeN memlimits Memory 33 Start R with default settings R gCO rigger Mb max used Mb 350000 94 350000 94 786432 60 337539 26 used Mb gc t Ncells 169451 46 Vcells 62425 05 Now let s create a large matrix 1000 x 1000 Before we do ask R to tell us when it does garbage collectionresizing of the available space 34 34 gcinFoTRUE m matrixrnorm1000 1000 1000 1000 Garbage collection 4 l03 level 2 180323 cons cells free 51 96 Mbytes of heap free 95 Garbage collection 5 l04 level 2 180330 cons cells free 51 96 Mbytes of heap free 54 Garbage collection 6 105 level 2 180333 cons cells free 51 96 Mbytes of heap free 37 objectsizem 1 8000120 33 35 Now let39s try that again but this time start R with 16b of memory Don39t do this unless you know you need it Start R and tell it to use 26b of space For data objects R min vsize26 36 86 35 Again rurn on repor ring of garbage collec rion gcinfoTRUE Now alloca re rhe same ma rrix m ma rrixrnorm1000 1000 1000 1000 No re rhere was no garbage collec rion 37 Fraqmen ra rion Fragmen ra rion happens when we crea re numerous objec rs and rhen remove some and leave holes in rhe alloca red memory x1 rnorm10000 x2 rnorm10000 y10x120x2 rmx2 33 37 Gil When we remove x2 we are lef r wi rh a big hole IF we go ro alloca re space For say 10001 elemen rs we canno r use rhis space We may have lo rs of li r rle pieces of space which cumula rively ro ral more rhan rhe desired amoun r of new space Bu r since rhey are no r con riguous we canno r use rhem and so we canno r sa risfy rhe new reques r We donquotr have much con rrol over rhis in R bu r if is good ro know abou r if 39 88 Quick Aside S ra ris rical Compu ring is an impor ran r and very under represen red area of research in rhe s ra ris rics eld There are many oppor runi ries ro do research in rhis area We are developing a progam in Davis and rhere are several o rhers emerging 39