Kepler Mini-Project

CS 410/510 Mini-Project 1 :Building a complex scientific workflow

Due Thursday, August 24


In the previous assignment you constructed a simple workflow using Kepler. In this assignment you will construct a more complex biological workflow using Kepler and several biology web services. We will examine an existing biological workflow written in Java or Perl, and rewrite it using Kepler.

Note that this assignment is considerably harder than the last one. The purpose of the assignment is to introduce you to some of the challenges of implementing scientific workflows and get you to think about the tradeoffs between hand-coding vs. using a tool like Kepler. Please start early and ask questions if you get stuck. I will be generous with partial credit if you carefully document what each part of the workflow is supposed to do, and will give points for the parts that work correctly. I recommend using lots of Display actors to display intermediate results and help with debugging.

During the testing phase, it may help to save the web service query results to a file and replace the web service actor with a File Reader. That way you won't need to wait a few minutes for the web service to return your query results each time you test your workflow.

A written description of the workflow, along with links to source code and the Web Services used, is at http://xml.nig.ac.jp/workflow/blast_clustal.html.

Download the source code in either Java or PERL (whichever language you are most familiar with. Personally I found the PERL easier to read). You are welcome to compile and run the code if you wish, but it is not required. Note that to run this code, you will need Apache axis (for Java) or SOAP:Lite (for PERL) on your machine. If you are unable to run the code on your machine, don't worry, you only need to be able to read the source code to complete this assignment.

(Note that a WSDL description of any web service is available at http://xml.nig.ac.jp/wsdl/.wsdl). The links on the workflow page will take you to a "Method" page that provides a query interface the method used by the given web service. Clicking on the "document" link on the page will give you documentation on other methods provided by the web service. Note that some methods have identical functionality but produce outputs in different formats, you may find these useful later.

1. (5 points) Using the query in the test.txt file (downloaded with your Java or Perl source code), use the Blast web service linked from the workflow webpage. Approximately how many distinct accession numbers (second column) are in the result? How long would it take to run GetEntry on each of these by hand?

2.(5 points) Create a blank Kepler workflow and insert a Web Service actor. Right click on the actor and select configure and type in the Blast WSDL URL (http://xml.nig.ac.jp/wsdl/Blast.wsdl) for the wsdlUrl value (leave the others blank for now). Click "commit" and wait a few seconds. Configure the actor again and look at the methodName field. Has the interface changed? How? Set the methodName to "searchParam" and click commit. What does the Web Service icon look like now?

3. (80 points) Construct the BLAST-ClustalW workflow using Kepler Web Service actors for each service specified at the web site (as well as any other actors you need). You can copy the parameter settings you need for each web service from the source code (Java or PERL). Note that you can also get hints on how to parse the results returned by a web service by looking at the PERL or Java code. Many of the string and array actors that you saw in the last assignment will be useful here (especially those that use regular expressions).

Basically your workflow should do the following: Given a sequence from the file test.txt:

1.run the Blast web service on this query
2. Extract the top K distinct accession numbers (second column) from the result and run the GetEntry web service on each.
3.Extract the accession number, organism, and sequence from each of the K results.
4.Construct a sequence for each in FASTA format.
5.Paste these K sequences together to form a multiple sequence alignment query and submit to the query to the ClustalW web service.

(Note: removing duplicates in Kepler seems to be tricky. The Array Sort actor has a remove duplicates option, you are welcome to use this actor. However this actor also changes the sort order of the result returned by the BLAST web service. For the purposes of this assignment that is fine, but note that if you were using this workflow in practice you would need to write an actor or function to remove duplicates while preserving the original order)

4.(10 points) Discuss your experiences building this workflow. What parts were hardest/easiest? Comparing your Kepler workflow to the PERL or Java source code, what are the advantages/disadvantages of each type of implementation (i.e., using a scientific workflow tool vs. coding by hand)? Your response does not need to be long, one or two paragraphs is fine.
Laura G Bright
Last modified: Thu Aug 10 14:30:44 PDT 2006