COVID-19 Ad hoc Analysis (3)｜ABLER｜NTT DATA REAL INTELLIGENCE with Agility

(3) Data Cleansing of Ingested Data

The last article was about ingesting the Johns Hopkins University Datasets into an NoSQL database. This article will focus on the Data preparation stage.

Finding out the details of existent columns

As mentioned on the last article, The Johns Hopkins Datasets changes both the number of columns, and the column name over time. What types of columns will remain after removing redundant columns of each data? The Data Discovery Solution provides functions to explore data inside the NoSQL database, and reverse engineer the schema information of contained data.

When loading the JHU data, Data Discovery also analyzes the data and figures out the schema information as listed below. The column name is listed in lexical order, the "９" stands for numeric data, and the "A" stands for character data types.

It is written in the last article, "Comparing the Jan 22 data which has 6 columns to March 22 data which has 12 columns, there are differences in the number of columns" but this time, there were additional columns added and now we are dealing with 12 to 15 columns.

The changes made to the JHU datasets are organized below.

Started with 6 columns, January 22, 2020
Columns Latetude and Longitude added March 1, 2020.
In March 22, Multiple Columns Active, FIPS, Admin2, Conbined_key were added, and some columns were edited: Last Update was changed to Last_Update, Latitude to Lat, Longitude to Long_.

The facts above means that comparing the first datasets provided in January with the recent datasets will not provide the information needed to understand all the columns. It is a lot of work but, in this case we must check every part of the data, and the efficient way would be to use tools (in this case the Data Discovery) to help out ease the process when looking through every data within.

Cleansing 1: Standardizing the Column names

First, columns that were edited(Last Update to Last_Update, Latitude to Lat, Longitude to Long_.) must be updated in the database.

The March 1st data on the database will be as follows.

{
  "Province_State": "", 
  "Country_Region": "Spain", 
  "Confirmed": "84", 
  "Deaths": "0", 
  "Recovered": "2", 
  "Latitude": "40.0000", 
  "Longitude": "-4.0000", 
  "Last Update": "2020-03-01T23:33:03"
}

Change this to fit the shape of March 22nd's JSON property data(check to see Last Update changed to Last_Update, Latitude to Lat and, Longitude to Long_ ).

{
  "Province_State": "", 
  "Country_Region": "Spain", 
  "Confirmed": "84", 
  "Deaths": "0", 
  "Recovered": "2", 
  "Lat": "40.0000", 
  "Long_": "-4.0000", 
  "Last_Update": "2020-03-01T23:33:03"
}

The NoSQL that we are using, MarkLogic, can execute JavaScript on DBMS and update the data within the database. Below is the process used to standardize the column names.

/**
** Standardize JSON property
*/
let targetDir = "";
let fromProp ="";
let toProp = "";

// Change the variables here to use it with different datasets
targetDir = "/covid19/csse_covid_19_data/csse_covid_19_daily_reports/";
fromProp = "Last Update";
toProp = "Last_Update";

declareUpdate();
let myUris = cts.uris("", null, 
  cts.andQuery([
    cts.directoryQuery(targetDir)
  ])
);
let docObj;
let myVal;
for (myUri of myUris){
  docObj = cts.doc(myUri).toObject();
  if (fromProp in docObj){
    myVal = docObj[fromProp];
    delete docObj[fromProp];
    docObj[toProp] = myVal;
    xdmp.documentInsert(myUri, docObj);
  }
}

Summary: The「Store, then process」method of Schema-less DB

When using the traditional approach with the relational database, the process below would be needed

Explore every CSV file header and columns from January 22nd to the most recent data
Define the table(CREATE TABLE) while thinking about the changes, updates that might occur to the columns
Developing the means for an ETL while thinking about the changes, updates that might occur to the columns
Ingest CSV files to the database

Compared to the above, when using NoSQL databases, the process changes. The Ingestion comes first, and after the CSV files are stored, then comes the analyzing of schema, then the data processing.

First Store all the CSV files from January 22nd to the most recent data
Within the database, explore the JSON property data (using Data Discovery)
only process the changed columns of data
New columns are automatically added to the data

As stated, the flexibility of Schema-less databases are demonstrated.

Next, we will go further into the Data Cleansing process.