Wednesday, May 10, 2017

Python utility like the Unix cut command - Part 1 - cut1.py

By Vasudev Ram

Regular readers of my blog might have noticed that I sometimes write (and write about) command line utilities in posts here.

Recently, I thought of implementing a utility like the Unix cut command in Python.

Here is my first cut at it (pun intended :), below.

However, instead of just posting the final code and some text describing it, this time, I had the idea of doing something different in a post (or a series of posts).

I thought it might be interesting to show some of the stages of development of the utility, such as incremental versions of it, with features or code improvements added in each successive version, and a bit of discussion on the design and implementation, and also on the thought processes occurring during the work.

(For beginners: incidentally, speaking of thought processes, during interviews for programming jobs at many companies, open-ended questions are often asked, where you have to talk about your thoughts as you work your way through to a solution to a problem posed. I know this because I've been on both sides of that many times. This interviewing technique helps the interviewers gauge your thought processes, and thereby, helps them decide whether they think you are good at design and problem-solving or not, which helps decide whether you get the job or not.)

One reason for doing this format of post, is just because it can be fun (for me to write about, and hopefully for others to read - I know I myself like to read such posts), and another is because one of my lines of business is training (on Python and other areas), and I've found that beginners sometimes have trouble going from a problem or exercise spec to a working implementation, even if they understand well the syntax of the language features needed to implement a solution. This can happen because 1) there can be many possible solutions to a given programming problem, and 2) the way to break down a problem into smaller pieces (stepwise refinement), that are more amenable to translating into programming statements and constructs, is not always self-evident to beginners. It is a skill one acquires over time, as one keeps on programming for months and years.

So I'm going to do it that way (with more explanation and multiple versions), over the course of a few posts.

For this first post, I'll just describe the rudimentary first version that I implemented, and show the code and the output of a few runs of the program.

The Wikipedia link about Unix cut near the top of this post describes its behavior and command line options.

In this first version, I only implement a subset of those:
- reading from a file (only a single file, and not reading from standard input (stdin))
- only support the -c (for cut by column) option (not the -b (by byte) or -f (by field) options)
- only support one column specification, i.e. -cm-n, not forms like -cm1-n1,m2-n2,...

In subsequent versions, I'll add support for some of the omitted features, and also fix any errors that I find in previous versions, by testing.

I'll call the versions cutN.py, where 1 <= N <= the highest version I implement. So this current post is about cut1.py.

Here is the code for cut1.py:
"""
File: cut1.py
Purpose: A Python tool somewhat similar to the Unix cut command.
Does not try to be exactly the same or implement all the features 
of Unix cut. Created for educational purposes.
Author: Vasudev Ram
Copyright 2017 Vasudev Ram
Web site: https://vasudevram.github.io
Blog: https://jugad2.blogspot.com
Product store: https://gumroad.com/vasudevram
"""

from __future__ import print_function

import sys
from error_exit import error_exit

def usage(args):
    #"Cuts the specified columns from each line of the input.\n", 
    lines = [
    "Print only the specified columns from lines of the input.\n", 
    "Usage: {} -cm-n file\n".format(args[0]), 
    "or: cmd_generating_text | {} -cm-n\n".format(args[0]), 
    "where -c means cut by column and\n", 
    "m-n means select (character) columns m to n\n", 
    "For each line, the selected columns will be\n", 
    "written to standard output. Columns start at 1.\n", 
    ]
    for line in lines:
        sys.stderr.write(line)

def cut(in_fil, start_col, end_col):
    for lin in in_fil:
        print(lin[start_col:end_col])

def main():

    sa, lsa = sys.argv, len(sys.argv)
    # Support only one input file for now.
    # Later extend to support stdin or more than one file.
    if lsa != 3:
        usage(sa)
        sys.exit(1)

    prog_name = sa[0]

    # If first two chars of first arg after script name are not "-c",
    # exit with error.
    if sa[1][:2] != "-c":
        usage(sa)
        error_exit("{}: Expected -c option".format(prog_name))

    # Get the part of first arg after the "-c".
    c_opt_arg = sa[1][2:]
    # Split that on "-".
    c_opt_flds = c_opt_arg.split("-")
    if len(c_opt_flds) != 2:
        error_exit("{}: Expected two field numbers after -c option, like m-n".format(prog_name))

    try:
        start_col = int(c_opt_flds[0])
        end_col = int(c_opt_flds[1])
    except ValueError as ve:
        error_exit("Conversion of either start_col or end_col to int failed".format(
        prog_name))

    if start_col < 1:
        error_exit("Error: start_col ({}) < 1".format(start_col))
    if end_col < 1:
        error_exit("Error: end_col ({}) < 1".format(end_col))
    if end_col < start_col:
        error_exit("Error: end_col < start_col")
    
    try:
        in_fil = open(sa[2], "r")
        cut(in_fil, start_col - 1, end_col)
        in_fil.close()
    except IOError as ioe:
        error_exit("Caught IOError: {}".format(repr(ioe)))

if __name__ == '__main__':
    main()
Here are the outputs of a few runs of cut1.py. I used this text file for the tests.
The line of digits at the top acts like a ruler :) which helps you know what character is at what column:
$ type cut-test-file-01.txt
12345678901234567890123456789012345678901234567890
this is a line with many words in it. how is it.
here is another line which also has many words.
now there is a third line that has some words.
can you believe it, a fourth line exists here.
$ python cut1.py
Print only the specified columns from lines of the input.
Usage: cut1.py -cm-n file
or: cmd_generating_text | cut1.py -cm-n
where -c means cut by column and
m-n means select (character) columns m to n
For each line, the selected columns will be
written to standard output. Columns start at 1.

$ python cut1.py -c
Print only the specified columns from lines of the input.
Usage: cut1.py -cm-n file
or: cmd_generating_text | cut1.py -cm-n
where -c means cut by column and
m-n means select (character) columns m to n
For each line, the selected columns will be
written to standard output. Columns start at 1.

$ python cut1.py -c a
cut1.py: Expected two field numbers after -c option, like m-n

$ python cut1.py -c0-0
Print only the specified columns from lines of the input.
Usage: cut1.py -cm-n file
or: cmd_generating_text | cut1.py -cm-n
where -c means cut by column and
m-n means select (character) columns m to n
For each line, the selected columns will be
written to standard output. Columns start at 1.

$ python cut1.py -c0-0 a
Error: start_col (0) < 1

$ python cut1.py -c1-0 a
Error: end_col (0) < 1

$ python cut1.py -c1-1 a
Caught IOError: IOError(2, 'No such file or directory')

$ python cut1.py -c1-1 cut-test-file-01.txt
1
t
h
n
c

$ python cut1.py -c6-12 cut-test-file-01.txt
6789012
is a li
is anot
here is
ou beli

$ python cut1.py -20-12 cut-test-file-01.txt
Print only the specified columns from lines of the input.
Usage: cut1.py -cm-n file
or: cmd_generating_text | cut1.py -cm-n
where -c means cut by column and
m-n means select (character) columns m to n
For each line, the selected columns will be
written to standard output. Columns start at 1.
cut1.py: Expected -c option

$ python cut1.py -c20-12 cut-test-file-01.txt
Error: end_col < start_col
- Vasudev Ram - Online Python training and consulting

Get updates (via Gumroad) on my forthcoming apps and content.

Jump to posts: Python * DLang * xtopdf

Subscribe to my blog by email

My ActiveState Code recipes

Follow me on: LinkedIn * Twitter

Are you a blogger with some traffic? Get Convertkit:

Email marketing for professional bloggers


No comments: