Saturday, August 15, 2009
7:12 AM

Regex and you; and sed.

Regular expressions, hence forth known as regex or regexes, are very powerful, but can be quite confusing; even for some who use regex regularly.
First and foremost:
It's best to get in the habit of interpreting regular expressions in a rather literal way. For example, don't think:
^cat matches a line with cat at the beginning, but rather:
^cat matches if you have the beginning of a line, followed immediately by c, followed immediately by a, followed immediately by t.
They both end up meaning the same thing, but reading it the more literal way allows you to intrinsically understand a new expression when you see it.

Quote from Mastering Regular Expressions 2nd Edition.

For the following, I'm using something that I needed to use today.
I'm on a slow dial-up connection, keeping Debian up to date is easy with the Stable release, but I like Testing, so I make apt-get print the uris that are going to be downloaded.
Except, apt-get prints a lot of information I don't want; hashes and the like. The URL of the files are wrapped around apostrophes;
'http://security.debian.org/pool/updates/main/i/imagemagick/libmagick10_6.3.7.9.dfsg2-1~lenny3_i386.deb' libmagick10_7%3a6.3.7.9.dfsg2-1~lenny3_i386.deb 4027048 SHA256:b52b9a47a7abe0466f3a6b81e2e7bf0e76123971c6ec4bbf86ca373f83002b90
'http://ftp.it.debian.org/debian/pool/main/e/eglibc/libc6_2.9-23_i386.deb' libc6_2.9-23_i386.deb 4367254 SHA256:4a69953fdbc3e29992ee2d55167f1dc37c4f8a8f36906252473cded37a9bca24

I don't need the SHA2 hash and I don't need the apostrophes, just the URL.
So with a very simple command, using 2 pipes in bash ( | ), awk, and sed; I can output all the uris, without SHA2 hash and remove the apostrophes.
apt-get -qq --print-uris dist-upgrade | awk '{print $1}' | sed "s/'//g" > packages

This sends the above quoted lines to awk, which prints the first variable, in this case it's all the urls, these get sent to sed, where sed is told to look for ' using a regex pattern ("/s'), and globally replace it with nothing. (//g") So, with a single line I got rid of everything I couldn't use.
Those two lines above are now:
http://security.debian.org/pool/updates/main/i/imagemagick/libmagick10_6.3.7.9.dfsg2-1~lenny3_i386.deb
http://ftp.it.debian.org/debian/pool/main/e/eglibc/libc6_2.9-23_i386.deb

Notice that sed didn't match everything inside of the apostrophes? Go read the quote at the top of the article, it'll all sink in eventually. Sed found the character ' and removed it, found another ' and removed it.

The next example is a little confusing, but I'm going to jump right to it.
A while back I needed a way to grab the filename, with extension, with no leading slash.
With a jumble of various characters, regex can find exactly what you want, or exactly what you don't want...
s = re.search(r"[-_a-zA-Z0-9]+\.([a-zA-Z0-9]{1,3})$", url)
[Note from RobotCow: This is from the Python programming language using the re module. Can you guess what re means?]

This is exactly what I wanted, sort of.
Here's a simple breakdown of the function:
Search the text "url" (It's a variable Sherlock.) at the end of the line for a set of characters, any length, that contain any letter or number or a -, followed immediately by at least one ., followed by 3 or less characters, of any letter or number.
What about spaces in the filename, or %20 spaces? Well, they don't get found, so a filename of "some file.zip", you only get file.zip.
See if you can find a way to include spaces and other characters in the filename.

I suggest everyone read the O'Reilly book Mastering Regular Expressions 2nd Edition, by Jeffrey E. F. Friedl.
Also, this is by far not a complete and definitive guide, so go download a cheatsheet, as well.

In these final words, have fun in learning regex; it can be a little confusing at the start, but once you to understand the way the expressions work, the easier it is to use regex.

0 comments:

Post a Comment