Column

COVID-19 Ad hoc Analysis (4)

June 19 2020
Taiga Nielsen (Taiga Nielsen@NTT DATA)
Jun Shiromizu (Jun Shiromizu@NTT DATA)

This Article, using the COVID-19 data as an example, explains the Data aggregation Methodology for ingesting variety of datasets and Ad hoc Analysis.

(4) Standardization and cleansing of code schemes

The last article was about the cleansing stage. Standardizing Johns Hopkins University Datasets that had changes upon their column names(JSON properties) as newer the datasets became on the NoSQL database. This article will again focus on the Data Cleansing of the data.

Cleansing 2:Standardizing the datetime format

Sampling values of the data on the database, notice there is a inconsistency with the item Last_Update. This item is a datetime data type, and we could see there were 3 ways the data was expressed in the data.

  • 2020-03-27 22:14:55
  • 2020-02-05T12:13:04
  • 3/22/20 23:45

The first and the Second type is rather similar, but the "T" between the date and the time makes the two different.

This Last_Update column gives how up-to-date the data [numbers of infected, death, and recovered] data from COVID19 is. Inconsistent date forms are a problem, so for this circumstance we have Standardized the date format to YYYY-MM-DD. We can change the Last_Update value, but for this occasion, we added a JSON property Period .

For example, below is the DB data before cleansing.

As below, we will process the data to add the column Period

Like the last article, processing the job below will result in the standardized datetime data.

Cleansing 3:Standardization of Country, and Region

Next, lets look at data inside the database for the Country_Region column which defines the countries which hold the [infected, death, and recovered] data.

Looking at the [Country/Region] column value, notice some problems below.

Inconsistencies in the [Country/Region] format

First, examples of having the same country/region but different formats of writing.

  • Russia, and Russian Federation
  • United Kingdom, and UK
  • Cabo Verde, and Cape Verde
  • Viet Nam, and Vietnam
  • Guernsey, and Channel Islands

also, there are Different formats for Special Administrative Region for China.

  • Hong Kong, and Hong Kong SAR
  • Macao, and Macao SAR

There were inconsistencies between the formats for having the word "Republic" in or not.

  • Ireland, and Republic of Ireland
  • Czechia, and Czech Republic
  • Moldova, and Republic of Moldova
  • Iran, and Iran (Islamic Republic of)
  • Korea, South, Republic of Korea, and South Korea

Inconsistencies between the formats for having the word "The" in or not, is also noticed.

  • Bahamas, Bahamas, The, and The Bahamas
  • Gambia, Gambia, The, and The Gambia

One of the major characteristics of this dataset is that there is cruise ship names shown in the country/region section.

  • Cruise Ship
  • Diamond Princess
  • MS Zaandam

There are columns that showed the "other" for unknown purposes as well.

  • Others

Setting the Country/Region code schemes

Since the Johns Hopkins University Datasets have inconsistencies within their data formats, it is important to consider organizing the code scheme of the data, especially thinking about when combining the information with other datasets.

For the Country/Region code schemes, we chose the ISO3166-1 format. And for the process, we have decided to add a new CountryCode column rather than changing the Country_Region.

Notice there is the new value in the CountryCode column below.

The data above is acquired by using the process below within the NoSQL database.

In the real code, lines 30~34 which represents the replacement words of each Country/Region is much longer. This way, Russia and Russian Federation will be replaced by RU. Columns that are not stated in the ISO format, will be stated in the CountryCode as the value of the Country_Region, such as the Diamond Princess.

Summary: Adding Newer columns as we go

Like the last article, we have focused on the schema-less database approach of “Storing, then cleansing data in steps” explaining the flexible nature of the database.

Especially this time, there are new columns added as new datasets come in. It is like adding columns to a table when using RDB as an example. Doing this without the system stopping, is a advantage of a schema-less, NoSQL database.

standardizing the“country/region” item by applying the existing ISO code scheme. This also made combining other datasets with ease since most datasets are using the ISO code scheme.

The next article will focus on how to create a line graph of the number of infected people(COVID-19) per country using the datasets we have been preparing in the last 3 articles.

Contact

NTTDATA Corporation
Business Development Promotion Office,
Fourth Financial Sector,