Jason Lozier's Big Science Survey

Tuesday, 17 November 2009

Cliques

The problem with graphviz is that it's made by engineers. There are things you can do with it that you probably didn't know, and things you would probably think are obvious, that it doesnt do.

A simple example is links between nodes. Take Leo and I, if I'm directed torwards leo, the syntax is simple.

"Jason" -> "Leo"

You would similarly think the converse could be written

"Jason" <- "Leo"

But it can't, and neither can

"Jason" <-> "Leo"

both of these will cause graphviz to exit with an error. instead you must use

"Jason" -> "Leo" [dir = both]

which to me makes little or no sense. Flipping the direction of a node should be easy.

My second problem is cliques. If 5 people all collaborate with eachother, then there are 4+3+2+1 edges between them. There is no easy to way to link lots of nodes together in a clique using one line of dot.

Jason -> Leo -> Dan [dir = all]

would seem like a logical argument for this, but it doesnt work. However, you can be cheeky and nest multiple connections.

a {b c d}

means a links to b, c and d.

a{b{c{d}}}

makes edges between all listed nodes, and graphviz will happily render this, but the syntax checkers do not like it, and with good reason, it's hacky, and adding extra attributes/arguements to the edges/nodes would be a nightmare.

Luckily there is yapgvb - a module for python that will generate dot files (or render them to svg,png, etc.), and there is a nice example of how to create a clique in the documentation. This should make it easy to generate cliques in graphviz.

Next up: Database Normalisation, Less is More or Less Better than More?

Tuesday, 10 November 2009

Interviews

Just interviewed several members of the biology and chemistry departments and have found several interesting trends with regards to thinking about survey questions.

Collaboration needs to well defined. Questions asking about different discrete interactions should be asked. (Do you email, exchange materials, lab discussion, etc...)
People are happy to give 10 minutes for an undergraduate's final year project survey :)
The survey should be in one fell swoop. Breaking it up would be annoying for users.
All published results needs to be anonymised. People will be reluctant to reply otherwise.
January 6th, Week 0 of Spring term is the best date to send the survey - no exams to mark, backlog of emails from xmas cleared.
An incentive may be helpful - especially with regards to getting post-docs, PhDs etc. Whether this is one coffee per person, or an amazon voucher (maybe a raffle, 20p per person that responds?) is to be decided.
People don't always say that they've "collaborated" with people they've published with. Reinforcing the need for a better descriptor.
It seems that around 5-10 grant proposals are made per year per PI, with ~30% rejection rate.
Grants may need to be broken down (research councils, charities, industry, studentships etc.)
Strong inter-departmental collaborations exist between biology, chemistry and computer science, but there are probably more.
It will be interesting to look at the flow of data between mac/win/*nix users, and if there's an OS barrier there.
Seeing how physical distance is correlated between the graphical distance between nodes will be interesting, maybe the further you go, the less edges there are (within the University).
Some people have data that they would publish, but don't have the time, or don't think it would be published in a prestigious enough journal to be worthwhile.
People tend to collaborate regularly with 5-10 people outside of the university. How we define or question this is another matter.
People can easily list people they've collaborated with over the last 2-3 years, but may not know who they've shared data or materials with, as the handling of such requests may be delegated to someone else.

A Conclusion?

There is a tradeoff between the amount of information, and the amount of respondents. We want enough information to be useful, but don't want to ask so many questions that people are put off. Achieving this balance will be tricky. To eliminate bias in the survey, questions will have to be specific enough so we can confidently say that people are interpreting the question correctly, but not so specific to burden the user with the time needed to answer.

First Meeting

Yesterday I had my first meeting with Leo Caves and Dan Franks to kickstart the my project.

I went in with the idea that we we're going to be looking at the non-published collaborative network, but Dan threw in the idea of mapping the published network, and adding weight to the graph edges.

This is a really good idea, however, coming from a biology background, I feel that to fulfil this properly I would have to learn a bit more coding that I originally envisaged, and I feel would be more mathematical (which is not my strength, I'm not innumerate - I've only done A Level Physics, not maths!) than biological. However, I would be able to build upon the work done by previous third year project students.

The other option, of harvesting unpublished network data manually (email, speak to people etc...) seemed like it would develop a more diverse skill set in able to pull it off. I'm going to have to speak to social scientists, biologists, psychologists, wed designers, network specialists, and loads of other people to pull this off.

The biggest problem is making sure that I have useful data.

If the questions are too vague, then there will be bias towards how people answer them. The idea is to split the questions into bite-sized chunks where there's no room for interpretation, to allow my dataset to be consistent. Defining words like "collaboration" is hard, when people might consider handing a plasmid/antibody over as collaboration, or a chat over a cup of coffee, or actually working on a exchanging data.

Where does the buck stop? There needs to be discrete, quantifiable data for me to be able to apply graph/network metrics.

We've opened pandora's box (maybe a bad metaphor, I'm not implying that I've let loose terror within YCCSA or the biology department - unless you count my presence here) and there's plenty of discussion and prototyping to be had before I even begin to ask biologists questions.

Next up: Thoughts on designing a survey for scientists.

Monday, 2 November 2009

First Post

Need to put something here, will update with a more detailed post later. Current todos involve:

Learn DOT language, graphviz ins and outs.
Learn basic PHP, possibly a bit of AJAX, improving on my python (will check the library out).
Sort out technical stuff.
Read some papers (social networks, mapping social networks, graph theory etc).
Write some ideas down. Update this lab blog.