Dealing with Data

Article Info

Writing Data from file

We’re continuing from last week, so if you’ve not managed to parse a CSV file yet, try parsing the CSV data. If you are stuck, my solution is available here.

Writing data to a file

As we saw last week, CSV files are (mostly) comma-seperated values where each item is on a new line. Most spreadsheet and data-processing applications can natively understand these file formats. We do need to be careful if the data can contain commas.

One common way to write names is to put surname a comma then a first name. For example, WALTON-RIVERS, Joseph. This will result in the names being sorted by surname correctly if sorted lexographically (ie, by ascii value). However, this would pose a problem for our CSV parser.

There are a few ways round this:

  1. Use a CSV parsing library that can deal with this (by processing quotes and escaped commas (\,)

  2. Use a different seperator (eg, use tabs rather than commas for seperation) - this is often also supported by spreadsheet programs

  3. Replace the commas when you save the file (WALTON-RIVERS, joseph becomes WALTON-RIVERS: Joseph then the value is replaced when loading the CSV

Option 3 is probably the 'most hacky' of these, because then you can’t tell if someone has that special character in their name now. However in practive I have seen this used as a way of getting something 'done quickly'. Now you must encouter such pain as well.

The Christmas-Themed List

We’re going to process a christmas-themed list of names. I have prepared this using the 'Faker' library. You can download it here, and the script to generate this dataset is also in the same place.

This fake list of data is meant to represent a list of people who are on the naughty/nice list. We will be doing some data cleanup and writing a new list.

  • The name column is using the 'hacky' approach described above, split this into two columns: last_name and first_name

Note

For real world data, we probably shouldn’t be making assumptions about name stuctures like this. There are many assumptions we make about names which often are violated by real-world data. Arguably, we should just treat them as an opaque identifer (although, we’d usually also need some form of 'sort-order' column in that case).

This dataset is synthetic - I’m trying to put a christmas themed tint on a data processing task and I’m not that imaginative. Wait till you read the next bit and see what I use for the 'naughty' and 'nice' lists.

Naughty and Nice Lists

  • Add a new column to the data, we’ll call this 'status'

  • There are two columns to represent attendance data:

    • sessions is how many sessions the person was sechuled to be in

    • attended is how many they were recorded as attending

  • We’ll be processing this to generate our naughty list:

    • If the rate of attendance (attended/possible*100) in even, then 'status' should be 'naughty'

    • If the rate of attendance is odd, then 'status' should be 'oney' (or nice)

Yes, that is a binary pun. What did you expect from COMP101.

Output your modified CSV file. You can use a StreamWriter to output lines of text, and String formatting to generate a string with a suitable format. Don’t forget to output the header row!

Here is an example:

writer.cs
StreamWriter sw = new StreamWriter("helloWorld.csv");
sw.WriteLine( String.Format("{}, {}", "Hello", "world!") );
sw.Close();

Generating Text

We can use the same idea to generate text. HTML is just text with extra steps. Markdown (and asciidoc) is text with strange rules and slightly less steps. LaTeX is text-processing with stranger rules and more steps, but at least the math renders pretty.

Using the string processing techniques we just coverd, we can generate a markdown list of naughty and nice people. This is the same idea as the CSV file approach, but now we need to skip the other group when outputting.

There are a few different ways you could do this. Experiment to see what you can come up with.

Santa.md
# Naughty

* name 1
* name 2
* name 3

# Nice (Oney)

* name 1
* name 2
* name 3

HTML

HTML is a smilar idea, but the output is more verbose. I write these lab scripts an a text-based markup language (asiiidoc) then a program converts them into HTML (asciidoctor-web-pdf), and then a static site generator (hugo) dumps them into the theme.

In other words, the whole stack for generating these scripts is Plain-text (I even use the text-editor, Vim to write these). Plain-text formats have a lot of power and play very well with git, but we sometimes need to output HTML directly.

Here is a similar list, formatted as HTML:

Santa.html
<!doctype html>
<html>
  <head>
    <title>The List</title>
  </head>
  <body>
    <h1>List</h1>

    <h2>Naughty</h2>

    <ul>
      <li>name 1</li>
      <li>name 2</li>
      <li>name 3</li>
    </ul>

    <h2>Nice (Oney)</h2>

    <ul>
      <li>name 1</li>
      <li>name 2</li>
      <li>name 3</li>
    </ul>
</html>

Add a new function to your code that can output a document like this. You’ll be looking more at this next study block.

With the remaining time:

  1. Work on the remaining worksheets

  2. Add CSS styling to your lists

Graduation Cap Book Open book GitHub Info chevron-right Sticky Note chevron-left Puzzle Piece Square Lightbulb Video Exclamation Triangle Globe