05-script.html

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="generator" content="pandoc">
    <title>Software Carpentry: The Unix Shell</title>
    <link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
    <link rel="stylesheet" type="text/css" href="css/swc.css" />
    <link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
    <meta charset="UTF-8" />
    <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
    <!--[if lt IE 9]>
      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
  </head>
  <body class="lesson">
    <div class="container card">
      <div class="banner">
        <a href="http://software-carpentry.org" title="Software Carpentry">
          <img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
        </a>
      </div>
      <article>
      <div class="row">
        <div class="col-md-10 col-md-offset-1">
                    <a href="index.html"><h1 class="title">The Unix Shell</h1></a>
          <h2 class="subtitle">Shell Scripts</h2>
          <section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Write a shell script that runs a command or series of commands for a fixed set of files.</li>
<li>Run a shell script from the command line.</li>
<li>Write a shell script that operates on a set of files defined by the user on the command line.</li>
<li>Create pipelines that include shell scripts you, and others, have written.</li>
</ul>
</div>
</section>
<p>We are finally ready to see what makes the shell such a powerful programming environment. We are going to take the commands we repeat frequently and save them in files so that we can re-run all those operations again later by typing a single command. For historical reasons, a bunch of commands saved in a file is usually called a <strong>shell script</strong>, but make no mistake: these are actually small programs.</p>
<p>Let’s start by going back to <code>molecules/</code> and putting the following line into a new file, <code>middle.sh</code>:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">cd</span> molecules
$ <span class="kw">nano</span> middle.sh</code></pre></div>
<p>The command <code>nano middle.sh</code> opens the file <code>middle.sh</code> within the text editor “nano” (which runs within the shell). If the file does not exist, it will be created. We can use the text editor to directly edit the file. We’ll simply insert the following line:</p>
<pre><code>head -n 15 octane.pdb | tail -n 5</code></pre>
<p>This is a variation on the pipe we constructed earlier: it selects lines 11-15 of the file <code>octane.pdb</code>. Remember, we are <em>not</em> running it as a command just yet: we are putting the commands in a file.</p>
<p>Then we save the file (using Ctrl-O), and exit the text editor (using Ctrl-X). Check that the directory <code>molecules</code> now contains a file called <code>middle.sh</code>.</p>
<p>Once we have saved the file, we can ask the shell to execute the commands it contains. Our shell is called <code>bash</code>, so we run the following command:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">bash</span> middle.sh</code></pre></div>
<pre class="output"><code>ATOM      9  H           1      -4.502   0.681   0.785  1.00  0.00
ATOM     10  H           1      -5.254  -0.243  -0.537  1.00  0.00
ATOM     11  H           1      -4.357   1.252  -0.895  1.00  0.00
ATOM     12  H           1      -3.009  -0.741  -1.467  1.00  0.00
ATOM     13  H           1      -3.172  -1.337   0.206  1.00  0.00</code></pre>
<p>Sure enough, our script’s output is exactly what we would get if we ran that pipeline directly.</p>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="text-vs.whatever"><span class="glyphicon glyphicon-pushpin"></span>Text vs. Whatever</h2>
</div>
<div class="panel-body">
<p>We usually call programs like Microsoft Word or LibreOffice Writer “text editors”, but we need to be a bit more careful when it comes to programming. By default, Microsoft Word uses <code>.docx</code> files to store not only text, but also formatting information about fonts, headings, and so on. This extra information isn’t stored as characters, and doesn’t mean anything to tools like <code>head</code>: they expect input files to contain nothing but the letters, digits, and punctuation on a standard computer keyboard. When editing programs, therefore, you must either use a plain text editor, or be careful to save files as plain text.</p>
</div>
</aside>
<p>What if we want to select lines from an arbitrary file? We could edit <code>middle.sh</code> each time to change the filename, but that would probably take longer than just retyping the command. Instead, let’s edit <code>middle.sh</code> and replace <code>octane.pdb</code> with a special variable called <code>$1</code>:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">nano</span> middle.sh</code></pre></div>
<p>Now, within “nano”, replace the text <code>octane.pdb</code> with the special variable called <code>$1</code>:</p>
<pre class="output"><code>head -n 15 &quot;$1&quot; | tail -n 5</code></pre>
<p>Inside a shell script, <code>$1</code> means “the first filename (or other parameter) on the command line”. We can now run our script like this:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">bash</span> middle.sh octane.pdb</code></pre></div>
<pre class="output"><code>ATOM      9  H           1      -4.502   0.681   0.785  1.00  0.00
ATOM     10  H           1      -5.254  -0.243  -0.537  1.00  0.00
ATOM     11  H           1      -4.357   1.252  -0.895  1.00  0.00
ATOM     12  H           1      -3.009  -0.741  -1.467  1.00  0.00
ATOM     13  H           1      -3.172  -1.337   0.206  1.00  0.00</code></pre>
<p>or on a different file like this:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">bash</span> middle.sh pentane.pdb</code></pre></div>
<pre class="output"><code>ATOM      9  H           1       1.324   0.350  -1.332  1.00  0.00
ATOM     10  H           1       1.271   1.378   0.122  1.00  0.00
ATOM     11  H           1      -0.074  -0.384   1.288  1.00  0.00
ATOM     12  H           1      -0.048  -1.362  -0.205  1.00  0.00
ATOM     13  H           1      -1.183   0.500  -1.412  1.00  0.00</code></pre>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="double-quotes-around-arguments"><span class="glyphicon glyphicon-pushpin"></span>Double-Quotes Around Arguments</h2>
</div>
<div class="panel-body">
<p>We put the <code>$1</code> inside of double-quotes in case the filename happens to contain any spaces. The shell uses whitespace to separate arguments, so we have to be careful when using arguments that might have whitespace in them. If we left out these quotes, and <code>$1</code> expanded to a filename like <code>methyl butane.pdb</code>, the command in the script would effectively be:</p>
<pre><code>head -n 15 methyl butane.pdb | tail -n 5</code></pre>
<p>This would call <code>head</code> on two separate files, <code>methyl</code> and <code>butane.pdb</code>, which is probably not what we intended.</p>
</div>
</aside>
<p>We still need to edit <code>middle.sh</code> each time we want to adjust the range of lines, though. Let’s fix that by using the special variables <code>$2</code> and <code>$3</code> for the number of lines to be passed to <code>head</code> and <code>tail</code> respectively:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">nano</span> middle.sh</code></pre></div>
<pre class="output"><code>head -n &quot;$2&quot; &quot;$1&quot; | tail -n &quot;$3&quot;</code></pre>
<p>We can now run:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">bash</span> middle.sh pentane.pdb 15 5</code></pre></div>
<pre class="output"><code>ATOM      9  H           1       1.324   0.350  -1.332  1.00  0.00
ATOM     10  H           1       1.271   1.378   0.122  1.00  0.00
ATOM     11  H           1      -0.074  -0.384   1.288  1.00  0.00
ATOM     12  H           1      -0.048  -1.362  -0.205  1.00  0.00
ATOM     13  H           1      -1.183   0.500  -1.412  1.00  0.00</code></pre>
<p>By changing the arguments to our command we can change our script’s behaviour:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">bash</span> middle.sh pentane.pdb 20 5</code></pre></div>
<pre class="output"><code>ATOM     14  H           1      -1.259   1.420   0.112  1.00  0.00
ATOM     15  H           1      -2.608  -0.407   1.130  1.00  0.00
ATOM     16  H           1      -2.540  -1.303  -0.404  1.00  0.00
ATOM     17  H           1      -3.393   0.254  -0.321  1.00  0.00
TER      18              1</code></pre>
<p>This works, but it may take the next person who reads <code>middle.sh</code> a moment to figure out what it does. We can improve our script by adding some <strong>comments</strong> at the top:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">nano</span> middle.sh</code></pre></div>
<pre class="output"><code># Select lines from the middle of a file.
# Usage: middle.sh filename end_line num_lines
head -n &quot;$2&quot; &quot;$1&quot; | tail -n &quot;$3&quot;</code></pre>
<p>A comment starts with a <code>#</code> character and runs to the end of the line. The computer ignores comments, but they’re invaluable for helping people understand and use scripts.</p>
<p>What if we want to process many files in a single pipeline? For example, if we want to sort our <code>.pdb</code> files by length, we would type:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">wc</span> -l *.pdb <span class="kw">|</span> <span class="kw">sort</span> -n</code></pre></div>
<p>because <code>wc -l</code> lists the number of lines in the files (recall that <code>wc</code> stands for ‘word count’, adding the <code>-l</code> flag means ‘count lines’ instead) and <code>sort -n</code> sorts things numerically. We could put this in a file, but then it would only ever sort a list of <code>.pdb</code> files in the current directory. If we want to be able to get a sorted list of other kinds of files, we need a way to get all those names into the script. We can’t use <code>$1</code>, <code>$2</code>, and so on because we don’t know how many files there are. Instead, we use the special variable <code>$@</code>, which means, “All of the command-line parameters to the shell script.” We also should put <code>$@</code> inside double-quotes to handle the case of parameters containing spaces (<code>&quot;$@&quot;</code> is equivalent to <code>&quot;$1&quot;</code> <code>&quot;$2&quot;</code> …) Here’s an example:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">cat</span> sorted.sh</code></pre></div>
<pre class="output"><code>wc -l &quot;$@&quot; | sort -n</code></pre>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">bash</span> sorted.sh *.pdb ../creatures/*.dat</code></pre></div>
<pre class="output"><code>9 methane.pdb
12 ethane.pdb
15 propane.pdb
20 cubane.pdb
21 pentane.pdb
30 octane.pdb
163 ../creatures/basilisk.dat
163 ../creatures/unicorn.dat</code></pre>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="why-isnt-it-doing-anything"><span class="glyphicon glyphicon-pushpin"></span>Why Isn’t It Doing Anything?</h2>
</div>
<div class="panel-body">
<p>What happens if a script is supposed to process a bunch of files, but we don’t give it any filenames? For example, what if we type:</p>
<pre><code>$ bash sorted.sh</code></pre>
<p>but don’t say <code>*.dat</code> (or anything else)? In this case, <code>$@</code> expands to nothing at all, so the pipeline inside the script is effectively:</p>
<pre><code>wc -l | sort -n</code></pre>
<p>Since it doesn’t have any filenames, <code>wc</code> assumes it is supposed to process standard input, so it just sits there and waits for us to give it some data interactively. From the outside, though, all we see is it sitting there: the script doesn’t appear to do anything.</p>
</div>
</aside>
<p>We have two more things to do before we’re finished with our simple shell scripts. If you look at a script like:</p>
<pre><code>wc -l &quot;$@&quot; | sort -n</code></pre>
<p>you can probably puzzle out what it does. On the other hand, if you look at this script:</p>
<pre><code># List files sorted by number of lines.
wc -l &quot;$@&quot; | sort -n</code></pre>
<p>you don’t have to puzzle it out — the comment at the top tells you what it does. A line or two of documentation like this make it much easier for other people (including your future self) to re-use your work. The only caveat is that each time you modify the script, you should check that the comment is still accurate: an explanation that sends the reader in the wrong direction is worse than none at all.</p>
<p>Second, suppose we have just run a series of commands that did something useful — for example, that created a graph we’d like to use in a paper. We’d like to be able to re-create the graph later if we need to, so we want to save the commands in a file. Instead of typing them in again (and potentially getting them wrong) we can do this:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">history</span> <span class="kw">|</span> <span class="kw">tail</span> -n 5 <span class="kw">&gt;</span> redo-figure-3.sh</code></pre></div>
<p>The file <code>redo-figure-3.sh</code> now contains:</p>
<pre><code>297 bash goostats -r NENE01729B.txt stats-NENE01729B.txt
298 bash goodiff stats-NENE01729B.txt /data/validated/01729.txt &gt; 01729-differences.txt
299 cut -d &#39;,&#39; -f 2-3 01729-differences.txt &gt; 01729-time-series.txt
300 ygraph --format scatter --color bw --borders none 01729-time-series.txt figure-3.png
301 history | tail -n 5 &gt; redo-figure-3.sh</code></pre>
<p>After a moment’s work in an editor to remove the serial numbers on the commands, and to remove the final line where we called the <code>history</code> command, we have a completely accurate record of how we created that figure.</p>
<p>In practice, most people develop shell scripts by running commands at the shell prompt a few times to make sure they’re doing the right thing, then saving them in a file for re-use. This style of work allows people to recycle what they discover about their data and their workflow with one call to <code>history</code> and a bit of editing to clean up the output and save it as a shell script.</p>
<h2 id="nelles-pipeline-creating-a-script">Nelle’s Pipeline: Creating a Script</h2>
<p>An off-hand comment from her supervisor has made Nelle realize that she should have provided a couple of extra parameters to <code>goostats</code> when she processed her files. This might have been a disaster if she had done all the analysis by hand, but thanks to <code>for</code> loops, it will only take a couple of hours to re-do.</p>
<p>But experience has taught her that if something needs to be done twice, it will probably need to be done a third or fourth time as well. She runs the editor and writes the following:</p>
<pre><code># Calculate reduced stats for data files at J = 100 c/bp.
for datafile in &quot;$@&quot;
do
    echo $datafile
    bash goostats -J 100 -r $datafile stats-$datafile
done</code></pre>
<p>(The parameters <code>-J 100</code> and <code>-r</code> are the ones her supervisor said she should have used.) She saves this in a file called <code>do-stats.sh</code> so that she can now re-do the first stage of her analysis by typing:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">bash</span> do-stats.sh *[AB].txt</code></pre></div>
<p>She can also do this:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">bash</span> do-stats.sh *[AB].txt <span class="kw">|</span> <span class="kw">wc</span> -l</code></pre></div>
<p>so that the output is just the number of files processed rather than the names of the files that were processed.</p>
<p>One thing to note about Nelle’s script is that it lets the person running it decide what files to process. She could have written it as:</p>
<pre><code># Calculate reduced stats for  A and Site B data files at J = 100 c/bp.
for datafile in *[AB].txt
do
    echo $datafile
    bash goostats -J 100 -r $datafile stats-$datafile
done</code></pre>
<p>The advantage is that this always selects the right files: she doesn’t have to remember to exclude the ‘Z’ files. The disadvantage is that it <em>always</em> selects just those files — she can’t run it on all files (including the ‘Z’ files), or on the ‘G’ or ‘H’ files her colleagues in Antarctica are producing, without editing the script. If she wanted to be more adventurous, she could modify her script to check for command-line parameters, and use <code>*[AB].txt</code> if none were provided. Of course, this introduces another tradeoff between flexibility and complexity.</p>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="variables-in-shell-scripts"><span class="glyphicon glyphicon-pencil"></span>Variables in shell scripts</h2>
</div>
<div class="panel-body">
<p>In the <code>molecules</code> directory, imagine you have a shell script called <code>script.sh</code> containing the following commands:</p>
<pre><code>head -n $2 $1
tail -n $3 $1</code></pre>
<p>While you are in the <code>molecules</code> directory, you type the following command:</p>
<pre><code>bash script.sh &#39;*.pdb&#39; 1 1</code></pre>
<p>Which of the following outputs would you expect to see?</p>
<ol style="list-style-type: decimal">
<li>All of the lines between the first and the last lines of each file ending in <code>.pdb</code> in the <code>molecules</code> directory</li>
<li>The first and the last line of each file ending in <code>.pdb</code> in the <code>molecules</code> directory</li>
<li>The first and the last line of each file in the <code>molecules</code> directory</li>
<li>An error because of the quotes around <code>*.pdb</code></li>
</ol>
</div>
</section>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="list-unique-species"><span class="glyphicon glyphicon-pencil"></span>List unique species</h2>
</div>
<div class="panel-body">
<p>Leah has several hundred data files, each of which is formatted like this:</p>
<pre><code>2013-11-05,deer,5
2013-11-05,rabbit,22
2013-11-05,raccoon,7
2013-11-06,rabbit,19
2013-11-06,deer,2
2013-11-06,fox,1
2013-11-07,rabbit,18
2013-11-07,bear,1</code></pre>
<p>Write a shell script called <code>species.sh</code> that takes any number of filenames as command-line parameters, and uses <code>cut</code>, <code>sort</code>, and <code>uniq</code> to print a list of the unique species appearing in each of those files separately.</p>
</div>
</section>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="find-the-longest-file-with-a-given-extension"><span class="glyphicon glyphicon-pencil"></span>Find the longest file with a given extension</h2>
</div>
<div class="panel-body">
<p>Write a shell script called <code>longest.sh</code> that takes the name of a directory and a filename extension as its parameters, and prints out the name of the file with the most lines in that directory with that extension. For example:</p>
<pre><code>$ bash longest.sh /tmp/data pdb</code></pre>
<p>would print the name of the <code>.pdb</code> file in <code>/tmp/data</code> that has the most lines.</p>
</div>
</section>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="why-record-commands-in-the-history-before-running-them"><span class="glyphicon glyphicon-pencil"></span>Why record commands in the history before running them?</h2>
</div>
<div class="panel-body">
<p>If you run the command:</p>
<pre><code>history | tail -n 5 &gt; recent.sh</code></pre>
<p>the last command in the file is the <code>history</code> command itself, i.e., the shell has added <code>history</code> to the command log before actually running it. In fact, the shell <em>always</em> adds commands to the log before running them. Why do you think it does this?</p>
</div>
</section>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="script-reading-comprehension"><span class="glyphicon glyphicon-pencil"></span>Script reading comprehension</h2>
</div>
<div class="panel-body">
<p>Joel’s <code>data</code> directory contains three files: <code>fructose.dat</code>, <code>glucose.dat</code>, and <code>sucrose.dat</code>. Explain what a script called <code>example.sh</code> would do when run as <code>bash example.sh *.dat</code> if it contained the following lines:</p>
<pre><code># Script 1
echo *.*</code></pre>
<pre><code># Script 2
for filename in $1 $2 $3
do
    cat $filename
done</code></pre>
<pre><code># Script 3
echo $@.dat</code></pre>
</div>
</section>
        </div>
      </div>
      </article>
      <div class="footer">
        <a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
        <a class="label swc-blue-bg" href="https://github.com/swcarpentry/shell-novice">Source</a>
        <a class="label swc-blue-bg" href="mailto:admin@software-carpentry.org">Contact</a>
        <a class="label swc-blue-bg" href="LICENSE.html">License</a>
      </div>
    </div>
    <!-- Javascript placed at the end of the document so the pages load faster -->
    <script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
    <script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
    <script src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'></script>
    <script>
      (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
      (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
      m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
      })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
    
      ga('create', 'UA-37305346-2', 'auto');
      ga('send', 'pageview');
    
    </script>
  </body>
</html>