Running Example and Project Organization

Running Example and Project Organization#

Everything starts somewhere, though many physicists disagree.

— Terry Pratchett

As with many research projects, the first step in our Zipf’s Law analysis is to download the research data and install the required software. Before doing that, it’s worth taking a moment to think about how we are going to organize everything. We will soon have a number of books from Project Gutenberg in the form of a series of text files, plots we’ve produced showing the word frequency distribution in each book, as well as the code we’ve written to produce those plots and to document and release our software package. If we aren’t organized from the start, things could get messy later on.

Zipf’s Law:#

Imagine a giant bowl filled with words from all your favorite books. Zipf’s law, named after linguist George Kingsley Zipf, predicts a curious pattern within this jumble. As you scoop out the most used words, one by one, you’ll find something fascinating: a rank-frequency relationship.

../_images/George_Kingsley_Zipf_1917.jpg

The most frequent word will appear roughly twice as often as the second most frequent word, three times as often as the third, and so on. This means a small number of words (“the,” “of,” “and”) dominate the word soup, while countless others appear rarely.

Mathematically, Zipf’s law looks like this: frequency of a word \(∝\) 1/rank. (Remember, “\(∝\)” means “proportional to.”) So, ranking the words by frequency and plotting them on a graph creates a characteristic curved line (with axes on a log scale), showing the sharp drop-off in usage.

But Zipf’s law isn’t limited to words. It pops up in surprising places! Here are some examples:

City sizes: The population of the second-largest city is roughly half that of the largest, the third is a third, and so on.
Website hits: The most popular page on a website gets far more visits than the second, and so on.
Income distribution: A small number of people hold a large chunk of wealth, while most have less. (This connects to the “Pareto principle”, also known as the 80/20 rule.)

This can be seen in the following graph:

While not perfect, Zipf’s law offers a powerful tool for understanding various systems. It tells us that few things are extremely common, while many things are rare. This pattern has implications for language evolution, information retrieval, economic analysis, and even urban planning.

However, it’s important to remember that Zipf’s law is an empirical observation, not an ironclad rule. Deviations occur, and other factors can influence frequency distributions. Still, its ubiquity and simplicity make it a valuable lens for exploring the hidden order in seemingly random data.

So, the next time you pick up a book, think of Zipf’s law at work. The words you see most often are just the tip of the iceberg, reflecting a deeper pattern about how information is distributed in our world.

Project Structure#

Project organization is like a diet: everyone has one, it’s just a question of whether it’s healthy or not. In the case of a project, “healthy” means that people can find what they need and do what they want without becoming frustrated. This depends on how well organized the project is and how familiar people are with that style of organization.

As with good coding style, small pieces in predictable places with readable names are easier to find and use than large chunks that vary from project to project and have names like “stuff”. We can be messy while we are working and then tidy up later, but experience teaches that we will be more productive if we make tidiness a habit.

In building the Zipf’s Law project, we’ll follow a widely used template for organizing small and medium-sized data analysis projects [Noble, 2009]. The project will live in a directory called zipf, which will also be a Git repository stored on GitHub chapter Git Command-line.

The following is an abbreviated version of the project directory tree as it appears toward the end of the book:

zipf/
├── .gitignore
├── CITATION.md
├── CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE.md
├── README.md
├── Makefile
├── bin
│   ├── book_summary.sh
│   ├── collate.py
│   ├── countwords.py
│   └── ...
├── data
│   ├── README.md
│   ├── dracula.txt
│   ├── frankenstein.txt
│   └── ...
├── docs
│   └── ...
├── results
│   ├── collated.csv
│   ├── dracula.csv
│   ├── dracula.png
│   └── ...
└── ...

The full, final directory tree is documented in the Appendix: Tree

Standard information#

Our project will contain a few standard files that should be present in every research software project, open source or otherwise:

README includes basic information on our project. We’ll add it in Section on including a README We’ll create it in Chapter Git Advanced, and extend it in Chapter Packaging.
LICENSE is the project’s license. We’ll add it in Section on including a license.
CONTRIBUTING explains how to contribute to the project. We’ll add it in here.
CONDUCT is the project’s Code of Conduct. We’ll add it in here
CITATION explains how to cite the software. We’ll add it here.

Some projects also include a CONTRIBUTORS or AUTHORS file that lists everyone who has contributed to the project, while others include that information in the README (we do this in Chapter Git Advanced or make it a section in CITATION. These files are often called boilerplate, meaning they are copied without change from one use to the next.

Organizing project content#

Following [Noble, 2009], the directories in the repository’s root are organized according to purpose:

Runnable programs go in bin/ (an old Unix abbreviation for “binary”, meaning “not text”). This will include both shell scripts, e.g., book_summary.sh developed in Chapter bash Advanced,, and Python programs, e.g., countwords.py, developed in Chapter Building Command-Line Tools with Python.
Raw data goes in data/ and is never modified after being stored. You’ll set up this directory and its contents in Section download the data.
Results are put in results/. This includes cleaned-up data, figures, and everything else created using what’s in bin and data. In this project, we’ll describe exactly how bin and data are used with Makefile created in Chapter Introduction to Make and Snakemake.
Finally, documentation and manuscripts go in docs/. In this project, docs will contain automatically generated documentation for the Python package, created in Section on Documentation using Sphinx.

This structure works well for many computational research projects and we encourage its use beyond just this book. We will add some more folders and files not directly addressed by [Noble, 2009] when we talk about testing (Chapter on Testing), provenance (Chapter Tracking Provenance), and packaging (Chapter Packaging).

Downloading the Data#

The data files used in the book are archived at an online repository called Figshare (which we discuss in detail in Section on where to archive data and can be accessed at:

https://doi.org/10.6084/m9.figshare.13040516

We can download a zip file containing the data files by clicking “download all” at this URL and then unzipping the contents into a new zipf/data directory (also called a folder) that follows the project structure described above. Here’s how things look once we’re done:

zipf/
└── data
    ├── README.md
    ├── dracula.txt
    ├── frankenstein.txt
    ├── jane_eyre.txt
    ├── moby_dick.txt
    ├── sense_and_sensibility.txt
    ├── sherlock_holmes.txt
    └── time_machine.txt

Summary#

Now that our project structure is set up, our data is downloaded, we are ready to start our analysis.

Getting ready#

Make sure you’ve downloaded the required data files (following Section downloading the data ) and installed the required software (as described here) before progressing to the next chapter.

Key Points#

Make tidiness a habit, rather than cleaning up your project files later.
Include a few standard files in all your projects, such as README, LICENSE, CONTRIBUTING, CONDUCT and CITATION.
Put runnable code in a bin/ directory.
Put raw/original data in a data/ directory and never modify it.
Put results in a results/ directory. This includes cleaned-up data and figures (i.e., everything created using what’s in bin and data).
Put documentation and manuscripts in a docs/ directory.
Refer to The Carpentries software installation guide if you’re having trouble, or send us an email.