BiPy

Monday, October 20, 2008

Django and Ensembl - part III

Basic object manipulation. Once the models have validated, we can use the interactive shell (IPython accessed via "python manage.py shell") to play around with data access.

In [1]: from djensembl.djpyensembl.models import CoordSystem

In [2]: CoordSystem.objects.all()

Out[2]: [, , , , ]

5 objects are returned from that call, corresponding to the 5 coord_systems. Just like in the tutorial, that naming system is not very informative. So, I modified the CoordSystem model just like they did by adding the __unicode__ method as such:

def __unicode__(self):
return str(self.coord_system_id) + ' -> ' + self.name

Now I get this:

In [1]: from djensembl.djpyensembl.models import CoordSystem

In [2]: CoordSystem.objects.all()

Out[2]: [ chromosome>, supercontig>, contig>, clone>, chromosome>]

I doubt that's ultimately what I want but it works for now.

Some of the built-in generic behaviors don't work well with this db due to the particular use/misuse/nonuse of the 'id' column. In a Django-centric-developed app, you could issue statements like:

c = CoordSystem.objects.get(pk=1)

But, since the table doesn't have an id column which is the primary key, auto-incremented, this call fails, as does c = CoordSystem.objects.get(id=1), obviously.

What does work is c = CoordSystem.objects.get(coord_system_id=17). I presume this is one of the 'drawback' of using legacy db's where the table structure is not designed with Django's shortcuts in mind. No problemo.

Django and Ensembl - part II

One thing I'll need to resolve going forward is the problem of read-only. The intent of ensembl is to use it 'as is', i.e in read-only mode. When Django starts a project and syncs the apps, it writes the default admin and auth tables into my ensembl db. Do I need these here? Do I need them at all? Can I set them up elsewhere?

If I don't want them or don't want them there, how best to make the db read-only? At the level of the MySQL db itself? Trusting that others won't abuse write privileges (yeah right)? Or in the Django code. I'm leaning towards the first option...

Django and Ensembl

Django 1.0 is out and so far I like it a lot. I worked through the tutorials and other than my html 101 skills, all seems fairly straightforward.

I thought I would write an app to access my local install of one of the ensembl databases to see what Django can offer for legacy dbs. Note that Django has a short tutorial page on legacy dbs. I'm skipping over the part where I point the Django settings to my local db...

What we do is set up the basic project structure as in the tutorial, then added my app (djpyensembl). With no excuse for poor directory/project/app naming I set up:

~/django_ensembl
/djensembl
/__init__.py
/djpyensembl <-- this is my app
/manage.py
/settings.py
/urls.py

Now, to create the models from an existing db, we use the "inspectdb" option for manage.py. Running "python manage.py inspectdb > models.py" will output the model statements as figured out by Django looking at the ensembl tables. Moving the models.py file into the djpyensembl app directory puts the models in the right place.

What happens next caught me off guard. I figured I'd jsut start the Django dev server using "python manage.py runserver 8080". This is what I got:

Validating models...
Unhandled exception in thread started by
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/django/core/management/commands/runserver.py", line 48, in inner_run
self.validate(display_num_errors=True)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/django/core/management/base.py", line 126, in validate
raise CommandError("One or more models did not validate:\n%s" % error_text)
django.core.management.base.CommandError: One or more models did not validate:
djpyensembl.interpro: "id": You can't use "id" as a field name, because each model automatically gets an "id" field if none of the fields have primary_key=True. You need to either remove/rename your "id" field or add primary_key=True to a field.

Turns out the interpro table in ensembl is set up in a way that's not conducive to Django.

mysql> desc interpro;
+-------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| interpro_ac | varchar(40) | NO | PRI | | |
| id | varchar(40) | NO | PRI | | |
+-------------+-------------+------+-----+---------+-------+

mysql> select * from interpro limit 1;
+-------------+---------+
| interpro_ac | id |
+-------------+---------+
| IPR000001 | PF00051 |
+-------------+---------+

mysql> SHOW CREATE TABLE interpro;
+----------+---------------------------------------------+
| Table | Create Table |
+----------+---------------------------------------------+
interpro | CREATE TABLE `interpro` (
`interpro_ac` varchar(40) NOT NULL default '',
`id` varchar(40) NOT NULL default '',
UNIQUE KEY `interpro_ac` (`interpro_ac`,`id`),
KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
+----------+---------------------------------------------+

There is no real primary key for the table and one of the fields is called 'id'. From what I can tell, this is the only table that uses a plain 'id' name as a column name and also from what I can tell, this seems a poor naming choice. Not knowing much about the nature of the entries in interpro and not planning on using them for this demo, I enter hack mode and immediately assign 'id' as the primary key.

mysql> ALTER TABLE interpro ADD PRIMARY KEY(id);
Query OK, 18220 rows affected (0.75 sec)
Records: 18220 Duplicates: 0 Warnings: 0

As a result of this, I need to update my Interpro model in ~/django_ensembl/djensembl/djpyensembl/models.py (the file created by inspectdb). Before the update, it looked like this:

class Interpro(models.Model):
interpro_ac = models.CharField(unique=True, max_length=120)
id = models.CharField(max_length=120)
class Meta:
db_table = u'interpro'

It now looks like:

class Interpro(models.Model):
interpro_ac = models.CharField(unique=True, max_length=120)
id = models.CharField(primary_key=True, max_length=120)
class Meta:
db_table = u'interpro'

Now when I run "python manage.py runserver 8080", the models will validate.

More on this later...

Monday, October 13, 2008

More on MySQL

In my last job, I witnessed a MySQL database excede the 4GB limit. It was a home-grown lims system keeping track of massive amounts of sequencing data. I don't imagine I'll see that again but if I had to keep an eye on things, it's always handy to remember the sweet little commands. The MySQL forum has a nice summary here. What I like is the use of the double '/ 1024' to get the size display in MB and the reminder that the 'database size' is calculated from multiple values. Here we see that the size is coming from both the table data and the indexes. I've never paid close attention to the size of the indexes, but it's good to remember that.

The actual amount of disk space a db takes up is queried by something like a 'du' in the appropriate directory. This is what threw us for a loop once with a problem in the InnoDB tables. I think it was a bonafide bug in MySQL at the time but the InnoDB tables were taking up a LOT of disk space... I'll have to remember what that was all about. It was a good lesson.

Tuesday, October 7, 2008

A bit of a case

Saw an interesting talk tonight on PyMeta which brought back memories of one of my old coworkers working on natural language processing. It also gave me some insight into how a compiler works. Neither of those points are related to what I wanted to jot down though. In showing us some code, the speaker (Tom Marsell) revealed a case-like statement. It struck me as sweet so I jotted it down...partially. When I got home, here is what I got to work:

def meth(operator, num1, num2):
return {
'+': lambda: num1 + num2,
'-': lambda: num1 - num2
}[operator]()

print meth('+', 2, 2)

I don't think that's what Tom had up there but it inspired me nonetheless. Thanks Tom.

Thursday, October 2, 2008

MySQL and MySQLdb

Yet another post on this stuff, for my own reference...

A common issue with MySQL on the Mac is an error like "ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock'". Help files suggest changing your /etc/my.cnf file which doesn't exist on a Mac. What I can do is use a ~/.my.cnf file instead. I can't remember now if I created this file or if it was created for me. I also can't remember if I put the '/var/lib/mysql/mysql.sock' path in for the mysql.sock file or not...

What MySQL did want was for the sock file path to be /tmp/mysql.sock, so I edited the section in my ~/.my.cnf file as such:

[client]
socket = /tmp/mysql.sock

That does the trick. I should note that ~/.my.cnf is the last of five locations MySQL will check for this configuration data. See here for more info.

Once I got that working again, I went to use mysqldb and got this error:

ImportError: No module named mysqldb

I think I had flushed out my mysqldb installs during a purge for a particularly nasty bug hunt. So, I go to reinstall it from a tarball. I use the classic "python setup.py install" and get this:

EnvironmentError: mysql_config not found

I've seen the errors about missing library files that you need to copy over or link to but this one was new... I'm thinking that the reason is that I was originally using my MySQL install directory for development work so it was on my path before. During the bug hunt, I had reset my path and eliminated the directory with the mysql_config file... Sure enough, that was it.

It turns out that I may have also deleted my mysql library files from /usr/local/mysql as that director no longer exists... May need to re-install MySQL...

...the next day after a bit of sleep...

Looks like the /usr/local/mysql wasn't the problem. I did a quick search on my puter for "mysqlclient_r" and lo and behold, there is an entry in the FAQ about this:

Build Errors
------------

ld: fatal: library -lmysqlclient_r: not found

mysqlclient_r is the thread-safe library. It's not available on
all platforms, or all installations, apparently. You'll need to
reconfigure site.cfg (in MySQLdb-1.2.1 and newer) to have
threadsafe = False.

As soon as I set threadsafe to False, it builds.

Tuesday, September 2, 2008

Applying Patches to Git Code

My experience with group code development in bioinformatics has been limited to this point. Thus the idea of working with patches is fairly foreign. This summer, I finally got to work with them on the Pygr project. Just wanted to jot down some notes for my own reference. A simple 'man patch' command on a *nix system reveals the goodies:

SYNOPSIS
patch [options] [originalfile [patchfile]]

but usually just

patch -pnum

Knowing that a patch is simply a diff file helps. Re-remembering that source code repositories are all about diffs then helps simplify the concept. Then finally, knowing that creating and applying patches has been build into *nix systems for ever helps ease any fears of compatibility. The syntax is a bit old and *nixy but fairly straightforward.

The only option that I want to expand on is the "-pnum" part:

-pnum or --strip=num
Strip the smallest prefix containing num leading slashes from each
file name found in the patch file. A sequence of one or more adja-
cent slashes is counted as a single slash. This controls how file
names found in the patch file are treated, in case you keep your
files in a different directory than the person who sent out the
patch. For example, supposing the file name in the patch file was

/u/howard/src/blurfl/blurfl.c

setting -p0 gives the entire file name unmodified, -p1 gives

u/howard/src/blurfl/blurfl.c

without the leading slash, -p4 gives

blurfl/blurfl.c

and not specifying -p at all just gives you blurfl.c. Whatever you
end up with is looked for either in the current directory, or the
directory specified by the -d option.

If I open up the patch file I'm working with, part way in I see the text:

diff --git a/pygr/cnestedlist.pyx b/pygr/cnestedlist.pyx
index fa46280..0324c00 100644

The first line is the diff command used to create the patch file. The second line uses the keyword "index". In this case, the diff was created from a relative path position, thus no leading slash "/". When I apply the patch, I'll also use relative positioning. The "-pnum" option will help me get the proper relative position with which to apply the patch. This helps ensure that I don't have to care where the patch creator has his code installed (i.e the arbitrary /home/joesmoe/pythoncode/whatever/) but our directory structure at some point will be the same since we're working on the same code base - thus it's all relative baby.

So, if I cd to the root directory for the patch, represented by "a" or "b" in the relative paths, I can then use "-p1" to trim the "a" or "b" part and properly refer to the files I want patched:

patch -p1 < /path/to/patch/file/0001-thingy.patch

And now the patch is applied as a diff, just like a source code management tool would do. Because I was applying these changes in my local git repo, I can now see the effect:

prompt# git status
# On branch master
# Changed but not updated:
# (use "git add ..." to update what will be committed)
#
# modified: pygr/cnestedlist.pyx
# modified: tests/seqdb_test.py
#
no changes added to commit (use "git add" and/or "git commit -a")