Processing Data

CSV files

CSV files are a text-based way of representing data. They are a series of values which are seperated by commas (,) with one record per newline. Today, we are going to write a basic CSV parser. There are cleaner approaches to this, which you will get the opporunity to explore next study block.

This is what a sample CSV file looks like:

example.csv

column1,column2,column3
a,100,500
b,200,100
c,600,100

In other words, it’s just plain text! There are many file formats like this, formats we can simply parse in a line-by-line basis. Text is a universal interface.

Create a CSV file on disk (eg, D drive) which we can process. You can make up the data but some columns should have numerical data, and some should have categorical (eg, a colour in words) data.

Reading text files

We’ve already seen the tools required to read files, back at the start of the module, Streams. We’re going to create a Stream Reader, which will allow us to read the file line-by-line.

  // FIXME your CSV file probably isn't in /tmp/...
  StreamReader stream =
                new StreamReader("/tmp/test.csv");

  string? line = stream.ReadLine();
  while (line != null)
  {
    Console.WriteLine("Read a line: {0}", line);
    line = stream.ReadLine();
  }

    stream.Close();

There are more elegant ways to do this (for example, File.ReadLines or avoidng the duplication by doing the assign in the while condition, but I want to keep this code reasonably easy to follow. You can explore these approaches if you’d like.

Parsing the line

Each line of our input file contains one 'row' of data. We need to process this by splitting it by , and then processing the array indicies as required (eg, parsing floats, ints, etc…). If you can’t remember how to parse things to an integer or a float look back at the start of the module.

To convert the string into string array (eg, "A,B,C" to "A", "B", "C") can be done using, Split:

string[] parts = line.Split(",");

For each iteration of the loop:

Split the string into parts,
then for each part convert it into a suitable type (eg, int, float, etc…)
print those variables to string using Console.WriteLine.

Using Classes to represent data

We can represent 'groups' of data using classes (structs would also work in this case).

A class can contain member functions and properties. Each instance of the class can have different values for their properties. Create a suitable class in your code to store each row of your CSV file:

class Row
{
    public string foo;
    public int bar;
    public int baz;
}

Figure out what data types you need to represent your row. Create suitable properties in your class for these elements. You should now know what types are suitable based on your previous activity.

You can call your class something more meaningful if the data you are processing actually has a meaning.

Factory methods

We’re going to create a factory method which will contain the logic for how to construct a new Row.

We don’t want our method to belong to any particular row, but rather to the concept of a row itself (this is actually why I introduced static vs instance varaibles in this wording, because it makes this kind of reasoning a bit clearer).

Our method will accept a string[] representation of the row as an argument, and we expect it to return an instance of the class.

class Row
{
    public string foo;
    public int bar;
    public int baz;

    public static Row Parse(string[] parts) {
        Row row = new Row();
        // TODO set the properties of row here

        return row;
    }
}

Tip

Methods vs Constructors

Some people might argue this should be a constructor rather than a static method. In my opinion they are wrong, but to be clear, this is my opinion on how to write clean code. There is no correct answer. My reasoning is as follows:

Contructors are for constructing objects
Parsing rows is not a construction activity, it’s a buiness logic (ie, we’re not allocating memory or similar)
Therefore, it’s not construction-related logic so belongs in a static method

Others will have similar chains of reasoning that arise at different answers. This is fine.

(advanced) overloading methods

You could also move the split into the function and accept a string rather than a string array. Actually, let’s make an overload set for doing that:

class Row
{
    public string foo;
    public int bar;
    public int baz;

    public static row Parse(string line) {
        return Parse( line.split(",") );
    }

    public static row Parse(string[] parts) {
        row row = new row();
        // todo set the properties of row here

        return row;
    }
}

If we pass a string to Parse, the first version will be used - which will split the string and then call the second version. If we give the function an array directly, the second version will be used. People wanting to use the code now have the choice - they can use whichever version makes the most sense based on what they are doing.

Using our factory method

We can use our factory method in our loop like so:

            string? line = stream.ReadLine();
            while (line != null)
            {
                string[] parts = line.Split(",");
                Row row = Row.Parse(parts);

                Console.WriteLine("Read a line: {0} {1} {2}", row.foo, row.bar, row.baz);
                line = stream.ReadLine();
            }

Printing our lines is useful, but not overly so. Let’s make use of some of the data structures we’ve encounterd so far to make this more useful. We don’t know how many rows are in the CSV file, lets make use a a List.

Our list will store rows, so the type should be List<Row>. We are leveraging generics to do this. The list writer at Microsoft didn’t even have to know or care about the concept of a 'Row' when writing the List class but we can now use them for our types. This is the power of generics/templates.

List<Row> rows = new List<Row>();

// then, in our loop:
rows.Add(row);

(extention) Adding some Enums

If you have categorical data, it makes more sense to store this as an enum (enumeration, not chemicals with a suitably assigned integer). Have a look at how you might Declare an use an Enum for this data.

Refactoring

Hey wait, we’ve just written a little block of self-contained code, this is the concept of, "Read the CSV file and dump into a List". We could extract this into its own method that takes a StreamReader argument and returns a List<Row>. Do this.

Extract the parsing logic into its own function.

Processing our Data

Now we have succesfully extracted our data from the CSV file, we can start to process it to draw some insights.

Here are some functions that might be useful to write:

Calculate the lowest and highest values for a given numeric property (bar, baz)
Use the above to calculate the range (max - min) or a given property
Calculate the mean (sum / total) for a given numeric property (bar, baz)
Calculate the counts for each element of categorical data, and return it as Dictionary<string,int> (or enum to integer).

You’ll look more at data processing techniques with John next study block. For now, have a go at writing one or more of these methods.

'Virtual' Properties

Let’s say that we wanted to normalise some our values (make it between 0 and 1), not to be confused with Database Normalization. We could modify the data as its read in, but that means modifying the data (and we might want the non-normalised version for some reason).

Add this code (or similar) to your Row class (assuming bar is between 0 and 100) - you can use Feature Scaling if its not:

public float BarNormalized {
    get {
        return bar / 100.0F;
    }
}

You can then use BarNormalized as if it was a property rather than function:

Console.WriteLine("Bar: {0}, NormalisedBar: {1}", row.bar, row.BarNormalized);

Extention Tasks

Use what you have just learnt to process the following CSV file: Iris Dataset

Many of the Advent of Code tasks are solved by reading files and doing some kind of processing to them. Have a go at some of avent of code puzzles from the previous days.

Food for thought

Here is the equaivlent python program (from memory, I’ve not tried to run this so expect errors):

import csv

with open("test.csv") as f:
    f_csv = csv.DictReader(f)

    rows = []
    for row in f_csv:
        print( "{0} {1} {2}".format(row['foo'], row['bar'], row['baz']) )
        rows.append( row )

    max_value = max( [ int( row['bar'] ) for row in rows ] )

This also handles something which we don’t handle in our minimal CSV reader, a,"b,c",d" should ignore the , in the quote marks, our code will split on it (creating 4 columns rather than 3).

This ease of expression for this kind of task is why I use Python for data processing tasks rather than CSharp. You will be learning to do this kind of thing in Python in COMP170. For balance, I wouldn’t use Python for large complex programs where static typing is useful, for that I’d use a higher-level language which are more suitable, such as C++ or Java.

Libraries like Pandas can make even more complicated expressions trival (grouping, filtering, partitioning, graphing, aggregation).