CSV files
CSV files are a text-based way of representing data. They are a series of values which are seperated by commas (,
)
with one record per newline. Today, we are going to write a basic CSV parser. There are cleaner approaches to
this, which you will get the opporunity to explore next study block.
This is what a sample CSV file looks like:
column1,column2,column3
a,100,500
b,200,100
c,600,100
In other words, it’s just plain text! There are many file formats like this, formats we can simply parse in a line-by-line basis. Text is a universal interface.
Create a CSV file on disk (eg, D drive) which we can process. You can make up the data but some columns should have numerical data, and some should have categorical (eg, a colour in words) data.
Reading text files
We’ve already seen the tools required to read files, back at the start of the module, Streams. We’re going to create a Stream Reader, which will allow us to read the file line-by-line.
// FIXME your CSV file probably isn't in /tmp/...
StreamReader stream =
new StreamReader("/tmp/test.csv");
string? line = stream.ReadLine();
while (line != null)
{
Console.WriteLine("Read a line: {0}", line);
line = stream.ReadLine();
}
stream.Close();
There are more elegant ways to do this (for example, File.ReadLines
or avoidng the duplication by doing
the assign in the while condition, but I want to keep this code reasonably easy to follow. You can explore
these approaches if you’d like.
Parsing the line
Each line of our input file contains one 'row' of data. We need to process this by splitting it by ,
and
then processing the array indicies as required (eg, parsing floats, ints, etc…). If you can’t remember
how to parse things to an integer or a float look back at the start of the module.
To convert the string into string array (eg, "A,B,C" to "A", "B", "C") can be done using, Split
:
string[] parts = line.Split(",");
For each iteration of the loop:
-
Split the string into parts,
-
then for each part convert it into a suitable type (eg, int, float, etc…)
-
print those variables to string using
Console.WriteLine
.
Using Classes to represent data
We can represent 'groups' of data using classes (structs would also work in this case).
A class can contain member functions and properties. Each instance of the class can have different values for their properties. Create a suitable class in your code to store each row of your CSV file:
class Row
{
public string foo;
public int bar;
public int baz;
}
Figure out what data types you need to represent your row. Create suitable properties in your class for these elements. You should now know what types are suitable based on your previous activity.
You can call your class something more meaningful if the data you are processing actually has a meaning.
Factory methods
We’re going to create a factory method which will contain the logic for how to construct a new Row.
We don’t want our method to belong to any particular row, but rather to the concept of a row itself (this
is actually why I introduced static
vs instance varaibles in this wording, because it makes this kind of
reasoning a bit clearer).
Our method will accept a string[] representation of the row as an argument, and we expect it to return an instance of the class.
class Row
{
public string foo;
public int bar;
public int baz;
public static Row Parse(string[] parts) {
Row row = new Row();
// TODO set the properties of row here
return row;
}
}
Tip
|
Methods vs Constructors
Some people might argue this should be a constructor rather than a static method. In my opinion they are wrong, but to be clear, this is my opinion on how to write clean code. There is no correct answer. My reasoning is as follows:
Others will have similar chains of reasoning that arise at different answers. This is fine. |
(advanced) overloading methods
You could also move the split into the function and accept a string rather than a string array. Actually, let’s make an overload set for doing that:
class Row
{
public string foo;
public int bar;
public int baz;
public static row Parse(string line) {
return Parse( line.split(",") );
}
public static row Parse(string[] parts) {
row row = new row();
// todo set the properties of row here
return row;
}
}
If we pass a string to Parse, the first version will be used - which will split the string and then call the second version. If we give the function an array directly, the second version will be used. People wanting to use the code now have the choice - they can use whichever version makes the most sense based on what they are doing.
Using our factory method
We can use our factory method in our loop like so:
string? line = stream.ReadLine();
while (line != null)
{
string[] parts = line.Split(",");
Row row = Row.Parse(parts);
Console.WriteLine("Read a line: {0} {1} {2}", row.foo, row.bar, row.baz);
line = stream.ReadLine();
}
Printing our lines is useful, but not overly so. Let’s make use of some of the data structures we’ve encounterd so far to make this more useful. We don’t know how many rows are in the CSV file, lets make use a a List.
Our list will store rows, so the type should be List<Row>
. We are leveraging generics to do this. The list
writer at Microsoft didn’t even have to know or care about the concept of a 'Row' when writing the List class
but we can now use them for our types. This is the power of generics/templates.
List<Row> rows = new List<Row>();
// then, in our loop:
rows.Add(row);
(extention) Adding some Enums
If you have categorical data, it makes more sense to store this as an enum
(enumeration, not chemicals with
a suitably assigned integer). Have a look at how you might Declare
an use an Enum for this data.
Refactoring
Hey wait, we’ve just written a little block of self-contained code, this is the concept of, "Read the CSV
file and dump into a List". We could extract this into its own method that takes a StreamReader
argument
and returns a List<Row>
. Do this.
Extract the parsing logic into its own function.
Processing our Data
Now we have succesfully extracted our data from the CSV file, we can start to process it to draw some insights.
Here are some functions that might be useful to write:
-
Calculate the lowest and highest values for a given numeric property (bar, baz)
-
Use the above to calculate the range (max - min) or a given property
-
Calculate the mean (sum / total) for a given numeric property (bar, baz)
-
Calculate the counts for each element of categorical data, and return it as Dictionary<string,int> (or enum to integer).
You’ll look more at data processing techniques with John next study block. For now, have a go at writing one or more of these methods.
'Virtual' Properties
Let’s say that we wanted to normalise some our values (make it between 0 and 1), not to be confused with Database Normalization. We could modify the data as its read in, but that means modifying the data (and we might want the non-normalised version for some reason).
Add this code (or similar) to your Row class (assuming bar is between 0 and 100) - you can use Feature Scaling if its not:
public float BarNormalized {
get {
return bar / 100.0F;
}
}
You can then use BarNormalized
as if it was a property rather than function:
Console.WriteLine("Bar: {0}, NormalisedBar: {1}", row.bar, row.BarNormalized);
Extention Tasks
Use what you have just learnt to process the following CSV file: Iris Dataset
Many of the Advent of Code tasks are solved by reading files and doing some kind of processing to them. Have a go at some of avent of code puzzles from the previous days.
Food for thought
Here is the equaivlent python program (from memory, I’ve not tried to run this so expect errors):
import csv
with open("test.csv") as f:
f_csv = csv.DictReader(f)
rows = []
for row in f_csv:
print( "{0} {1} {2}".format(row['foo'], row['bar'], row['baz']) )
rows.append( row )
max_value = max( [ int( row['bar'] ) for row in rows ] )
This also handles something which we don’t handle in our minimal CSV reader, a,"b,c",d"
should ignore the ,
in the quote marks, our code will split on it (creating 4 columns rather than 3).
This ease of expression for this kind of task is why I use Python for data processing tasks rather than CSharp. You will be learning to do this kind of thing in Python in COMP170. For balance, I wouldn’t use Python for large complex programs where static typing is useful, for that I’d use a higher-level language which are more suitable, such as C++ or Java.
Libraries like Pandas can make even more complicated expressions trival (grouping, filtering, partitioning, graphing, aggregation).