Week 4: Grep and Regular Expressions

Video Notes

  • Most basic use of diff is to get the difference between two files:
    • diff file1 file2
  • You can compare two directories as well
    • B flag will ignore things like blank lines
    • b flag will ignore things like blank spaces
    • y flag compares lines side-by-side
    • Use --width=50 (or another option) flag if you have a smaller screen
  • Find command searches directories for a file based on a pattern that you give it. 
    • Can search directories based on name pattern, access, permissions
    • Syntax for find: 
      • find path command “name”
      • find . -name “*resume*”
        • search the current directory based on name pattern for files that contain the word “resume” with any string coming before or after
    • find . -perm a=rwx,g=rwx,u=rwx
    • permissions: read, write, access by everyone
  • locate is faster than find because it uses databases:
    • to use you must first update the database with: sudo updatedb
    • locate “*resume*”
  • locate doesn’t allow you to specify a directory so you can use grep:
    • locate “*resume*” | grep sofiaavila 
    • grep makes sure that “sofiaavila” is in the output files
  • regular expressions:
    • if your expression contains a space, either surround the string in quotation marks or escape the space using a backslash to tell the computer that the space is not a real space, it’s part of the expression (i.e., the space isn’t separating arguments to grep)
    • grep “hello world” filename
    • grep hello\ world filename
    • “.” matches every character so:
      • grep s.ack filename will search for:
        • smack
        • snack
        • spack
        • stack
    • * character: repeat 0 or more times not 1 or more. 
    • address regular expression:
      • grep “[0-9]* .* [A-Z][A-Z] [0-9]{5}” tcp.c
    • grep “s[nm]ack” will look for “snack” or “smack”
    • {} for repetition: [0-9]{3}
    • looks at one previous character so we have to group strings
    • grep “(hello){5}”
    • grep “[^0-9]{5}” -- don’t match numbers
    • grep “^[^0-9]*$”
      • match all lines that don’t contain numbers
    • ^ is the starting invisible character for a line
    • grep “^.*hello$” 
      • if you only care that “hello” is at the end of the line
    • If you have: (group1) (group2) then \1 and \2 will refer to group1 and group2 respectively
  • Special characters:
    • \r : new line
    • \n : new line
    • \t : tab
    • \b : backspace character
    • \d : any digit [1-9]
    • \w : any word character, anything alphanumeric
    • \s : spaces, newlines, tabs
    • \B: same as ^\b
  • optional: (abc | def) match either abc or def 
  • optional: (abc)? - the preceding group is optional
  • * : match 0 or more times
  • + : match 1 or more times
  • More information: regular-expressions.info

Lab

SETUP

0.  Watch the grep videos and learn to love regular-expressions.  

1.  zsh is much more intuitive in how it deals with special characters and quotes and such. If you're on a myth machine, you're running csh by default.

2.  Make sure that you have an alias from grep to grep -P (capitalization matters).  This will use perl style regular expressions.  If you aren't using perl style regular expressions, a lot of the stuff in the videos won't work.  To do this in zsh or bash, add the following line to your ~/.zshrc or ~/.bashrc file:

alias grep="grep -P"

Also, Macs do not have the -P flag for access to perl style regular expressions with grep. With an upgrade to OSX 10.8, Apple removed support for this feature. If you're running any version of OSX below 10.8, you should still have support for the -P flag with grep. If you're unsure, type "man grep" and see if the -P flag exists. Otherwise, a command like this will not work:

grep "n{2}gdev{2}" *

and you will have to escape characters like this:

grep "n\{2\}gdev\{2\}" *

3.  Download grep-exercises.tar.gz and untar it. 

You can use

    tar -xvzf someTarFile

to extract (x) and display verbose (v) information from a gzipped (z) file (f).  If the file is not gzipped (ie, the format ends in .tar rather than .tar.gz), you will need to use tar -xvf rather than tar -xvzf.

PHONE NUMBERS

4.  Once you've untarred grep-exercises.tar.gz, open it and cd into the phone-numbers directory. There are 5 numbers directories.  Each of them has phone numbers intermixed with text.  Your task will be to make a regular expression that matches all of the well-formed phone numbers in every file in the directory.  

Numbers 1 is the easiest and Numbers 5 is the hardest.  Each successive file adds a new way that numbers can be formatted.  If a number is mal-formed, like the missing parens case for Numbers 4 or the missing dashes case for Numbers 5, you SHOULD NOT match those numbers.

Each directory has a "golden" file that has all of the phone numbers.

out.gold is what your grep command should output.  That is, if you run diff between out.gold and your grep output, there should be no output.  You will need to make sure that your grep command includes the golden file and excludes out.gold.  Also make sure that you run your grep command from within the directory -- otherwise, grep will include the path to every file, and that will mess with your output.

Note that the syntax that the shell uses to glob for files (select multiple files) is NOT the same as normal (perl style or grep style) regular expressions.  If you don't know how to do it, then try making an echo command that prints out the name of two files (echo is pretty much the simplest command, so we often use echo when we want to test out something about shell syntax).  Then, make an echo command that prints out the name of every file that starts with cs1u.  Then, make an echo command that prints out all of the files that start with cs1u and the golden file.  If you can't figure out how to do all of the files, you might want to check out the video on ZSH globbing.

For the first three, avoid using "OR"s. It's definitely possible. If you want to use an OR for the later ones, remember that grep treats whitespace literally.

For the last two, it is a mistake to look for an "elegant" solution.

If you want to output your command to a file, you can check the video on IO redirection.  Basically, grep foo bar > baz will grep for the pattern "foo" in the file "bar" and output the results to the file "baz."

numbers1 - Search for numbers with dashes (ie, 123-456-7890)

numbers2 - Search for numbers with dashes and those with no dashes (ie, 1234567890)

numbers3 - Search for numbers with dashes, no dashes, and those with parens (ie, (123)456-7890)

numbers4 - Again, you want dashes, no dashes, parens.  But someone malicious introduced malformed output into our files: missing parens (ie, 123456-7890 or 123)456-7890) -- so you SHOULD NOT match malformed numbers.

numbers5 - Dashes, no dashes, parens.  Now you have missing parens and missing dashes (ie, 123-4567890) as part of the malformations-- again you SHOULD NOT match malformed numbers.

EXTRA PROBLEMS (if you have time on your own, try these!): 

5. In the match-only-one folder there are three group-references files.  You need to make a regular expression that will match every line in group-references-1 and no lines in group-references-2 or group-references-3.  Then, make one that matches 2 but not 1 or 3.  Then, make one that matches 3 but not 1 or 2.  Do the same for lookarounds.