Friday, November 25, 2016

Processing DSV data (Delimiter-Separated Values) with Python

By Vasudev Ram



I needed to process some DSV files recently. Here is a Python program I wrote for it, with a few changes over the original. E.g. I do not show the processing of the data here; I only read and print it. Also, I support two different command-line options (and ways) to specify the delimiter character.

DSV (Delimiter-separated values) is a common tabular text data format, with one record per line, and some number of fields per record, where the fields are separated or delimited by some specific character. Some common delimiter characters used in DSV files are tab (which makes them TSV files, Tab-Separated Values, common on Unix), comma (CSV files, Comma-Separated Values, a common spreadsheet and database import-export format), the pipe character (|), the colon (:) and others.

They - DSV files - are described in this section, Data File Metaformats, under Chapter 5: Textuality, of Eric Raymond (ESR)'s book, The Art of Unix Programming, which is a recommended read for anyone interested in Unix (one of the longest-lived operating systems [1]) and in software design and development.

[1] And, speaking a bit loosely, nowaday Unix is also the most widely used OS in the world, by a fair margin, due to its use (as a variant) in Android and iOS based mobile devices, both of which are Unix-based, not to mention Apple MacOS and Linux computers, which also are. Android devices alone number in the billions.

The program, read_dsv.py, is a command-line utility, written in Python, that allows you to specify the delimiter character in one of two ways:

- with a "-c delim_char" option, in which delim_char is an ASCII character,
- with a "-n delim_code" option, in which delim_code is an ASCII code.

It then reads either the files specified as command-line arguments after the -n or -c option, or if no files are given, it reads its standard input.

Here is the code for read_dsv.py:
from __future__ import print_function

"""
read_dsv.py
Author: Vasudev Ram
Web site: https://vasudevram.github.io
Blog: https://jugad2.blogspot.com
Product store: https://gumroad.com/vasudevram
Purpose: Shows how to read DSV data, i.e. 

https://en.wikipedia.org/wiki/Delimiter-separated_values 

from either files or standard input, split the fields of each
line on the delimiter, and process the fields in some way.
The delimiter character is configurable by the user and can
be specified as either a character or its ASCII code.

Reference:
TAOUP (The Art Of Unix Programming): Data File Metaformats:
http://www.catb.org/esr/writings/taoup/html/ch05s02.html
ASCII table: http://www.asciitable.com/
"""

import sys
import string

def err_write(message):
    sys.stderr.write(message)

def error_exit(message):
    err_write(message)
    sys.exit(1)

def usage(argv, verbose=False):
    usage1 = \
        "{}: read and process DSV (Delimiter-Separated-Values) data.\n".format(argv[0])
    usage2 = "Usage: python" + \
        " {} [ -c delim_char | -n delim_code ] [ dsv_file ] ...\n".format(argv[0])
    usage3 = [
        "where one of either the -c or -n option must be given,\n",  
        "delim_char is a single ASCII delimiter character, and\n", 
        "delim_code is a delimiter character's ASCII code.\n", 
        "Text lines will be read from specified DSV file(s) or\n", 
        "from standard input, split on the specified delimiter\n", 
        "specified by either the -c or -n option, processed, and\n", 
        "written to standard output.\n", 
    ]
    err_write(usage1)
    err_write(usage2)
    if verbose:
        for line in usage3:
            err_write(line)

def str_to_int(s):
    try:
        return int(s)
    except ValueError as ve:
        error_exit(repr(ve))

def valid_delimiter(delim_code):
    return not invalid_delimiter(delim_code)

def invalid_delimiter(delim_code):
    # Non-ASCII codes not allowed, i.e. codes outside
    # the range 0 to 255.
    if delim_code < 0 or delim_code > 255:
        return True
    # Also, don't allow some specific ASCII codes;
    # add more, if it turns out they are needed.
    if delim_code in (10, 13):
        return True
    return False

def read_dsv(dsv_fil, delim_char):
    for idx, lin in enumerate(dsv_fil):
        fields = lin.split(delim_char)
        assert len(fields) > 0
        # Knock off the newline at the end of the last field,
        # since it is the line terminator, not part of the field.
        if fields[-1][-1] == '\n':
            fields[-1] = fields[-1][:-1]
        # Treat a blank line as a line with one field,
        # an empty string (that is what split returns).
        print("Line", idx, "fields:")
        for idx2, field in enumerate(fields):
            print(str(idx2) + ":", "|" + field + "|")

def main():
    # Get and check validity of arguments.
    sa = sys.argv
    lsa = len(sa)
    if lsa == 1:
        usage(sa)
        sys.exit(0)
    if lsa == 2:
        # Allow the help option with any letter case.
        if sa[1].lower() in ("-h", "--help"):
            usage(sa, verbose=True)
            sys.exit(0)
        else:
            usage(sa)
            sys.exit(0)

    # If we reach here, lsa is >= 3.
    # Check for valid mandatory options (sic).
    if not sa[1] in ("-c", "-n"):
        usage(sa, verbose=True)
        sys.exit(0)

    # If -c option given ...
    if sa[1] == "-c":
        # If next token is not a single character ...
        if len(sa[2]) != 1:
            error_exit(
            "{}: Error: -c option needs a single character after it.".format(sa[0]))
        if not sa[2] in string.printable:
            error_exit(
            "{}: Error: -c option needs a printable ASCII character after it.".format(\
            sa[0]))
        delim_char = sa[2]
    # else if -n option given ...
    elif sa[1] == "-n":
        delim_code = str_to_int(sa[2])
        if invalid_delimiter(delim_code):
            error_exit(
            "{}: Error: invalid delimiter code {} given for -n option.".format(\
            sa[0], delim_code))
        delim_char = chr(delim_code)
    else:
        # Checking for what should not happen ... a bit of defensive programming here.
        error_exit("{}: Program error: neither -c nor -n option given.".format(sa[0]))

    try:
        # If no filenames given, read sys.stdin ...
        if lsa == 3:
            print("processing sys.stdin")
            dsv_fil = sys.stdin
            read_dsv(dsv_fil, delim_char)
            dsv_fil.close()
        # else (filenames given), read them ...
        else:
            for dsv_filename in sa[3:]:
                print("processing file:", dsv_filename)
                dsv_fil = open(dsv_filename, 'r')
                read_dsv(dsv_fil, delim_char)
                dsv_fil.close()
    except IOError as ioe:
        error_exit("{}: Error: {}".format(sa[0], repr(ioe)))
        
if __name__ == '__main__':
    main()

Here are test runs of the program (both valid and invalid), and the results of each one:

Run it without any arguments. Gives a brief usage message.
$ python read_dsv.py
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...
Run it with a -h option (for help). Gives the verbose usage message.
$ python read_dsv.py -h
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...
where one of either the -c or -n option must be given,
delim_char is a single ASCII delimiter character, and
delim_code is a delimiter character's ASCII code.
Text lines will be read from specified DSV file(s) or
from standard input, split on the specified delimiter
specified by either the -c or -n option, processed, and
written to standard output.
Run it with a -v option (invalid run). Gives the brief usage message.
$ python read_dsv.py -v
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...
Run it with a -c option but no ASCII character argument (invalid run). Gives the brief usage message.
$ python read_dsv.py -c
read_dsv.py: read and process DSV (Delimiter-Separated-Values) data.
Usage: python read_dsv.py [ -c delim_char | -n delim_code ] [ dsv_file ] ...
Run it with a -c option followed by the pipe character (invalid run). The OS (here, Windows) gives an error message because the pipe character cannot be used to end a pipeline.
$ python read_dsv.py -c |
The syntax of the command is incorrect.
Run it with the -c option and the pipe character as the delimiter character, but protected (by double quotes) from interpretation by the OS shell (CMD).
$ python read_dsv.py -c "|" file1.dsv
processing file: file1.dsv
Line 0 fields:
0: |1|
1: |2|
2: |3|
3: |4|
4: |5|
5: |6|
6: |7|
Line 1 fields:
0: |field1|
1: |fld2|
2: | fld3 with spaces around it |
3: |    fld4 with leading spaces|
4: |fld5 with trailing spaces     |
5: |next field is empty|
6: ||
7: |last field|
Line 2 fields:
0: ||
1: |1|
2: |22|
3: |333|
4: |4444|
5: |55555|
6: |666666|
7: |7777777|
8: |88888888|
Line 3 fields:
0: ||
1: ||
2: ||
3: ||
4: ||
5: |                      |
6: ||
7: ||
8: ||
9: ||
10: ||
Run it with the -n option followed by 124, the ASCII code for the pipe character as the delimiter.
$ python read_dsv.py -n 124 file1.dsv
[Gives exact same output as the run above, as it should, because both use the same delimiter and read the same input file.]
Copy file1.dsv to file3.dsv. Change all the pipe characters (delimiters) to colons:
Run it with the -n option followed by 58, the ASCII code for the colon character.
$ python read_dsv.py -n 58 file3.dsv
[Gives exact same output as the run above, as it should, because other than the delimiters (pipe versus colon), the input is the same.]

I added support for the -n option to the program because it makes it more flexible, since you can specify any ASCII character as the delimiter (that makes sense), by giving its ASCII code.

And of course, to find out the values of the ASCII codes for these delimiter characters, I used the char_to_ascii_code.py program from my recent post:

Trapping KeyboardInterrupt and EOFError for program cleanup

You may have noticed that I mentioned delimiter characters and DSV files in that post too. The char_to_ascii_code.py utility shown in that post was created to find the ASCII code for any character (without having to look it up on the web each time).

- Enjoy.

- Vasudev Ram - Online Python training and consulting

- Black Flyday at Flywheel Wordpress Managed Hosting - get 3 months free on the annual plan.

Get updates on my software products / ebooks / courses.

Jump to posts: Python   DLang   xtopdf

Subscribe to my blog by email

My ActiveState recipes



4 comments:

Anonymous said...

do you know about the csv module in the stdlib? https://docs.python.org/library/csv.html

Vasudev Ram said...

Thanks for your comment, but sure I do - from a long time. In fact I had used it (apart from other common uses) along with my xtopdf toolkit to convert CSV files to PDF. There is a
CSVReader.py file in xtopdf:

https://bitbucket.org/vasudevram/xtopdf

Was just showing another and more generic way of processing DSV files - with this DSV reading program - generic in the sense of being able to handle different delimiters, though not in the same file. However it has its limitations too - it will not handle embedded delimiters. And I'm aware that the csv module can (at least some), and can also handle other delimiters than comma. See CSVReader.py in xtopdf.

Vasudev Ram said...


BTW, the valid_delimiter() function, which is defined in terms of invalid_delimiter(), is not currently called in the program. I left it in there, because I was playing around with them both to see which one reads more naturally in calls. As of now it looks like invalid_delimiter() is more natural to use, because I tend to use the style shown, of checking for all errors up front (like Go program(mer)s often do), and exiting early on any error that cannot be corrected, before the processing logic starts for the main (non-error) case. IMO, this programming style tends to make for code that is clearer to understand, and you can be sure that if you thought of and caught all the invalid conditions and handled them, there can be no surprises below in your main processing code, which can then concentrate solely on handling the main (non-error) case.

Vasudev Ram said...


Typo in post: "nowaday Unix" should be "nowadays Unix".