[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Subject Index][Author Index]

Re: advice for the undergrad (OL)



On Sat, Jul 05, 2008 at 09:38:59PM -0700, T. Michael Keesey scripsit:
> On Sat, Jul 5, 2008 at 8:17 PM, Graydon <oak@uniserve.com> wrote:
> > If you're going to be swoggling lots of character data (data
> > represented by text strings, I mean) -- genetic or molecular
> > analysis -- something like Perl is probably your best choice.
> > (There's a lot of it; being made of ancient effective Unix evil may
> > or may not warp your mind; other people are already using it; for
> > this very specific problem realm, nothing else is as good.)
> 
> Well, I think the primary utility of Perl in this regard is regular
> expressions (a.k.a. regex, regexp, etc.), and many programming
> languages have adopted these from Perl (or at least provide toolkits
> for them). Regex strings are part of native Java and PHP, and are
> actual elements of ActionScript. A great number of text editors allow
> search and replace by regular expression patterns, too.

All regular expression parsers are not created equal, and all string
handling syntaxes are not created equal, either.

The Perl regular expression parser is very good, the Perl form of
regular expression operators is a superset of the POSIX defined set of
regular expression operators, and having syntax like 

@matches = grep { m/GATACA/ } @genes;

integral to the language is extremely handy.  (That's 'get all the lines
in a list that contain the sequence I'm looking for and stick them in
this other list called matches', for those unfamiliar with Perl.)

(Discussion of the inadvisability of arbitrary back references,
non-finite state machines, and regular expression performance forcibly
suppressed.)

> Regular expressions are very useful and really *are* something that
> can be learned in a relatively short amount of time. (They're not a
> programming language in and of themselves, though, more like a textual
> query language.) The syntax is kind of odd, but it's succinct and
> powerful. Good place to start:
> http://www.regular-expressions.info/tutorial.html

No comment on the 'place to start', not having tried it, but yes,
knowing regular expressions is a very good thing.

Having taught them, I'm not sure I'll go with 'relatively short amount
of time'; this is one of those 'simple to learn, very hard to master'
sorts of things.

> >> Also very important in science are other types of computer languages,
> >> like query languages (SQL and variants thereof, XQuery, etc.), markup
> >> languages (XML, HTML, etc.), data languages (XML again, HTML
> >> microformats, YAML, weird little specialized languages like DOT and
> >> NEXUS, etc.),  etc. It's not all programming.
> >
> > Pretty much all of that is data representation of one kind or another.
> 
> Didn't mean to imply otherwise.

Data representation <> programming language, at least to my possibly
pedantic way of thinking.  So I kvetched. :)

> > That's a good thing to understand, but if the idea is to be able to
> > prove you can program, as distinct from a career in programming, I'd
> > suggest hauling in specialist help for that category of problem.
> 
> I dunno, knowing the basics of XML, for example, can help an awful lot
> for many tasks.

XML has no basics.

(and I so wish I'd realized that three years ago when I started
explaining an XML content management system to the folks who use it....)

Either you're using it as a markup language, in which case you don't
need to understand what you're doing beyond (possibly) the aspect of
semantic labelling, or you're using it in a 'create a vocabulary' sense,
and that's not basic at all; you get to start at the Unicode collation
algorithm and go from there.

This is potentially cool stuff, and XSL is fundamentally a tree
manipulation language, and it's not at all a bad thing to know, but if
you're going for a data representation XML vocabulary for some
specialized purpose, you will need to understand how the whole stack
works from Unicode up through DTDs or schemas through all the various
XML rules.

So while an XML vocabulary for cladograms (and some sort of renderer for
it) would be of great potential use, allowing simple exchange of large,
complex trees, actually building it in a robust way would require you to
understand what you're doing.

> A good introduction: http://w3schools.com/xml/xml_whatis.asp
> (Although that page lists a basic understanding of HTML and JavaScript
> as prerequisites, that's not really true, at least for the "XML Basic"
> section.)

That's a web focus, and a good bit of what is being given in that intro
is at least arguably wrong, in the so-simple-it's-wrong way.

 http://www.w3.org/XML/ gives (I think) a better overview, but then
again I actively use XML in some fairly complex ways so I'm doubtless
biased.

I think, rather than going for specifics, someone interested in what
language to learn for a paleo career might do well to:
        - find out what they're using where you want to go to grad school
          and learn that
        - take a guess at what you want to do and find out what's being used
          for that
        - stick to open, public file formats for absolutely everything; it
          can be an initial hassle but it pays off over time

-- Graydon