COVID-19 Ad hoc Analysis (4)｜ABLER｜NTT DATA REAL INTELLIGENCE with Agility

(4) Standardization and cleansing of code schemes

The last article was about the cleansing stage. Standardizing Johns Hopkins University Datasets that had changes upon their column names(JSON properties) as newer the datasets became on the NoSQL database. This article will again focus on the Data Cleansing of the data.

Cleansing 2：Standardizing the datetime format

Sampling values of the data on the database, notice there is a inconsistency with the item Last_Update. This item is a datetime data type, and we could see there were 3 ways the data was expressed in the data.

2020-03-27 22:14:55
2020-02-05T12:13:04
3/22/20 23:45

The first and the Second type is rather similar, but the "T" between the date and the time makes the two different.

This Last_Update column gives how up-to-date the data [numbers of infected, death, and recovered] data from COVID19 is. Inconsistent date forms are a problem, so for this circumstance we have Standardized the date format to YYYY-MM-DD. We can change the Last_Update value, but for this occasion, we added a JSON property Period .

For example, below is the DB data before cleansing.

{
  "Province_State": "", 
  "Country_Region": "Spain", 
  "Confirmed": "84", 
  "Deaths": "0", 
  "Recovered": "2", 
  "Lat": "40.0000", 
  "Long_": "-4.0000", 
  "Last_Update": "2020-03-01T23:33:03"
}

As below, we will process the data to add the column Period

{
  "Province_State": "", 
  "Country_Region": "Spain", 
  "Confirmed": "84", 
  "Deaths": "0", 
  "Recovered": "2", 
  "Lat": "40.0000", 
  "Long_": "-4.0000", 
  "Last_Update": "2020-03-01T23:33:03",
  "Period": "2020-03-01"
}

Like the last article, processing the job below will result in the standardized datetime data.

/**
** change the [Last_Update] date format to「YYYY-MM-DD」and add column「Period」
*/
let targetDir = "";
let fromProp ="";
let toProp = "";

// Change the variables here to use it with different datasets
targetDir = "/covid19/csse_covid_19_data/csse_covid_19_daily_reports/";
fromProp = "Last_Update";
toProp = "Period";

declareUpdate();
let myUris = cts.uris("", null, 
  cts.andQuery([
    cts.directoryQuery(targetDir)
  ])
);
let docObj;
let myVal;
let myPeriod;
for (myUri of myUris){
  docObj = cts.doc(myUri).toObject();
  if (fromProp in docObj){
    myVal = docObj[fromProp];
    let myDate = new Date(myVal);
    myPeriod = getYMD(myDate);
    docObj[toProp] = myPeriod;
    xdmp.documentInsert(myUri, docObj);
  }
}

function getYMD(myDate){
  var dt = myDate;
  var y = dt.getFullYear();
  var m = ("00" + (dt.getMonth()+1)).slice(-2);
  var d = ("00" + dt.getDate()).slice(-2);
  var result = y + "-" + m + "-" + d;
  return result;
}

Cleansing 3：Standardization of Country, and Region

Next, lets look at data inside the database for the Country_Region column which defines the countries which hold the [infected, death, and recovered] data.

/**
** Delete redundant values from target column
*/
let targetDir = "";
let targetProp ="";
let propValueList = [];
let myVal;

targetDir = "/covid19/csse_covid_19_data/csse_covid_19_daily_reports/";
targetProp = "Country_Region";

let myUris = cts.uris("", null, 
  cts.andQuery([
    cts.directoryQuery(targetDir)
  ])
);

for (myUri of myUris){
  docObj = cts.doc(myUri).toObject();
  myVal = docObj[targetProp];
  
  if (propValueList.indexOf(myVal) == -1){
    propValueList.push(myVal);
  }
}

propValueList;

Looking at the [Country/Region] column value, notice some problems below.

Inconsistencies in the [Country/Region] format

First, examples of having the same country/region but different formats of writing.

Russia, and Russian Federation
United Kingdom, and UK
Cabo Verde, and Cape Verde
Viet Nam, and Vietnam
Guernsey, and Channel Islands

also, there are Different formats for Special Administrative Region for China.

Hong Kong, and Hong Kong SAR
Macao, and Macao SAR

There were inconsistencies between the formats for having the word "Republic" in or not.

Ireland, and Republic of Ireland
Czechia, and Czech Republic
Moldova, and Republic of Moldova
Iran, and Iran (Islamic Republic of)
Korea, South, Republic of Korea, and South Korea

Inconsistencies between the formats for having the word "The" in or not, is also noticed.

Bahamas, Bahamas, The, and The Bahamas
Gambia, Gambia, The, and The Gambia

One of the major characteristics of this dataset is that there is cruise ship names shown in the country/region section.

Cruise Ship
Diamond Princess
MS Zaandam

There are columns that showed the "other" for unknown purposes as well.

Others

Setting the Country/Region code schemes

Since the Johns Hopkins University Datasets have inconsistencies within their data formats, it is important to consider organizing the code scheme of the data, especially thinking about when combining the information with other datasets.

For the Country/Region code schemes, we chose the ISO3166-1 format. And for the process, we have decided to add a new CountryCode column rather than changing the Country_Region.

Notice there is the new value in the CountryCode column below.

{
  "Province_State": "", 
  "Country_Region": "Spain", 
  "Confirmed": "84", 
  "Deaths": "0", 
  "Recovered": "2", 
  "Lat": "40.0000", 
  "Long_": "-4.0000", 
  "Last_Update": "2020-03-01T23:33:03",
  "Period": "2020-03-01",
  "CountryCode": "ES"
}

The data above is acquired by using the process below within the NoSQL database.

let targetDir = "";
let fromProp ="";
let toProp = "";

targetDir = "/covid19/csse_covid_19_data/csse_covid_19_daily_reports/";
fromProp = "Country_Region";
toProp = "CountryCode";

declareUpdate();
let myUris = cts.uris("", null, 
  cts.andQuery([
    cts.directoryQuery(targetDir)
  ])
);
let docObj;
let myVal;
let myContryCode;
for (myUri of myUris){
  docObj = cts.doc(myUri).toObject();
  if (fromProp in docObj){
    myVal = docObj[fromProp];
    myContryCode = getCountryCode(myVal);
    docObj[toProp] = myContryCode;
    xdmp.documentInsert(myUri, docObj);
  }
}

function getCountryCode(myVal){
  let myHenkan = [
    {keyword:"Afghanistan", code : "AF"},
    {keyword:"Albania", code : "AL"},
    {keyword:"Algeria", code : "DZ"},
      :
    {keyword:"Zimbabwe", code : "ZW"}
  ];
  
  try{
    var newLine = myHenkan.filter(function(item, index){
      if (item.keyword == myVal) return true;
    });
    return newLine[0].code;
  } catch(e){
    return "";
  }
}

In the real code, lines 30~34 which represents the replacement words of each Country/Region is much longer. This way, Russia and Russian Federation will be replaced by RU. Columns that are not stated in the ISO format, will be stated in the CountryCode as the value of the Country_Region, such as the Diamond Princess.

Summary: Adding Newer columns as we go

Like the last article, we have focused on the schema-less database approach of “Storing, then cleansing data in steps” explaining the flexible nature of the database.

Especially this time, there are new columns added as new datasets come in. It is like adding columns to a table when using RDB as an example. Doing this without the system stopping, is a advantage of a schema-less, NoSQL database.

standardizing the“country/region” item by applying the existing ISO code scheme. This also made combining other datasets with ease since most datasets are using the ISO code scheme.

The next article will focus on how to create a line graph of the number of infected people(COVID-19) per country using the datasets we have been preparing in the last 3 articles.