Snakemake Exercises#
1 Write two new rules#
Write a new rule for creating last.dat from data//last.txt. Call the rule count_words_last.
Update the dats rule with this target.
Write a new rule called zipf_test to write the summary table to results.txt. The rule needs to:
Depend upon each of the three .dat files.
Invoke the action python zipf_test.py jane_eyre.dat frankenstein.dat last.dat > results.txt.
Be the default target.
Update clean so that it removes results.txt.
2. Update Dependencies#
What will happen if you now execute:
touch *.dat
snakemake results.txt
nothing
all files recreated
only .dat files recreated
only results.txt recreated
3. Rewrite .dat rules to use wildcards#
Rewrite each .dat
rule to use the {input}
and {output}
wildcards.
4. Updating One Input File#
What will happen if you now execute:
touch data/last.txt
snakemake results.txt
only last.dat is recreated
all .dat files are recreated
only last.dat and results.txt are recreated
all .dat and results.txt are recreated
5. Update count_words_last
to depend on wordcount.py
#
Add zipf_test.py as a dependency of results.txt. We haven’t yet covered the techniques required to do this with named wildcards so you will have to use indexing. Yes, this will be clunky, but we’ll fix that part later!
Remember that you can do a dry run with snakemake -n -p!
6. Putting it all together#
Using the expand()
and glob_wildcards()
functions, modify the pipeline so that it automatically detects and analyzes all the files in the data/
folder.
Hint
Use expand()
and glob_wildcards()
together to create the value of DATS
.
7 .Moving Output Files into a Subdirectory#
Currently our workflow is generating a lot of files in the main directory. This is not so bad with small numbers of files, but it can get messy as the file count grows. One approach to this is to generate outputs into their own directories, named after the file types. For example:
zipf
├── data
│ ├── dracula.txt
│ ├── frankenstein.txt
│ ├── moby_dick.txt
│ └── time_machine.txt
├── dats
│ ├── dracula.dat
│ ├── frankenstein.dat
│ ├── moby_dick.dat
│ └── time_machine.dat
├── snakemake
│ ├── wordcount.py
│ ├── zipf_test.py
│ ├── plotcount.py
│ ├── results.txt
│ ├── run_pipeline.sh
│ └── snakefile
...
There are many potential arrangements, so you are free to choose whatever makes sense for your project. Snakemake is not prescriptive, it will put files wherever you tell it. So here we will learn how to move the dat
files into a dats
directory.
Alter the rules in your Snakefile so that the dat
files are created in
their own dats/
folder.
Note that creating this folder beforehand is unnecessary.
Snakemake automatically create any folders for you, as needed.
Hint
Make sure your Snakefile
is up to date with the end of the preceeding lesson. Use the provided solution files if necessary.Look for all the locations that reference the dat
files and update to add the dats/
directory.
8. Creating PNGs#
Your Task is to update your Snakefile so that it can create .png
files from dat
files using plotcount.py
.
The new rule should be called
make_plot
.All
.png
files should be created in a directory calledplots
. If you are using a Windows system, you could create the plots in the top-level directory instead in order to avoid the Windows subdirectory bug. You may need to change back to theplots
directory after we introduce theall
rule.
As well as a new rule you may also need to update existing rules.
Remember that when testing a pattern rule, you can’t just ask Snakemake to execute the rule by name. You need to ask Snakemake to build a specific file.
So instead of snakemake count_words
you need something like snakemake dats/last.dat
.
9. Generating Plots#
Default Rules#
The default rule is the rule that Snakemake runs if you don’t specify a rule
on the command-line (e.g.: if you just run snakemake
).
The default rule is simply the first rule in a Snakefile. While the default
rule can be anything you like, it is common practice to call the default rule
all
, and have it run the entire workflow.
Add an all
rule#
Add an all
rule to your Snakefile.
Note that all
rules often don’t need to do any processing of their own.
It is suffient to make them depend on all the final outputs from other rules.
In this case, the outputs are results.txt
and all the PNG files.
Hint
It is easiest to use glob_wildcards
and expand
to build the list of all expected .png
files.
10. Creating an Archive#
Let’s add a processing rule that depends on all previous stages of the workflow.In this case, we will create an archive tar file.
Update your pipeline to:
Create an archive file called
zipf_analysis.tar.gz
The archive should contain all
dat
files, plots, and the Zipf summary table (results.txt
).Update
all
to expectzipf_analysis.tar.gz
as input.Remove the archive when
snakemake clean
is called.
The syntax to create an archive is:
tar -czvf zipf_analysis.tar.gz file1 directory2 file3 etc
After these exercises our final workflow should look something like the following:

Fig. 69 Final Workflow#
11. Adding more books#
We can now do a better job at testing Zipf’s rule by adding more books. The books we have used come from the Project Gutenberg website.
Project Gutenberg offers thousands of free ebooks to download.
Exercise instructions:
go to Project Gutenberg and use the search box to find another book, for example ‘The Picture of Dorian Gray’ from Oscar Wilde.
download the ‘Plain Text UTF-8’ version and save it to the
books
folder;choose a short name for the file - optionally, open the file in a text editor and remove extraneous text at the beginning and end (look for the phrase
End of Project Gutenberg's [title], by [author]
)run
snakemake
and check that the correct commands are runcheck the results.txt file to see how this book compares to the others
BEFORE go further please check out the Section Snakemake Continued Part 2
12. What happens if Snakemake does not have enough resources?#
Modify your Snakefile and the snakemake arguments to test what happens when you have less resources available than the number required by a rule.
For example, you might set gpu=2
in make_plot
, and then run snakemake --resources gpu=1
.
What do you think will happen? What actually happens?
13. Replace all other duplicated strings with global variables#
You will need to update:
the string for
dat
files (dats/{file}.dat
)the string for plot files (
plots/{file}.png
)the archive file
zipf_analysis.tar.gz
the results file
results.txt
Hint
A formatted string can be used to get the global variables into the clean
shell command.
If you have inconsistent wildcard names, make them the same.
14. Combining global variables and wildcards in formatted strings#
The safest way to mix global variables and wildcards in a formatted string is to remember the following:
Global variables are surrounded in single curly braces (e.g.
{INPUT_DIR}
).Wildcards are surrounded with double curly braces {% raw %}(e.g.
{{book}}
){% endraw %}.Use upper-case for globals and lower-case for wildcards.
Make your workflow configurable
Move all other configurable values into config.yaml
and adjust the Snakefile.
Remember to test your workflow as you go.
15. Submitting a workflow with nohup#
nohup some_command &
runs a command in the background and lets it keep running if you log off. Try running the pipeline in cluster mode using nohup
(run snakemake clean
beforehand).
Where does the Snakemake log go to?
Why might this technique be useful?
You can kill the running Snakemake process with killall snakemake
.
Notice that if you try to run Snakemake again, it says the directory is locked.
You can unlock the directory with snakemake --unlock
.
16. Running Snakemake itself as a batch job#
Can we also submit the snakemake --cluster
pipeline as a batch job?
Is this a good idea? What are some problems of this approach?