(4) Standardization and cleansing of code schemes
The last article was about the cleansing stage. Standardizing Johns Hopkins University Datasets that had changes upon their column names(JSON properties) as newer the datasets became on the NoSQL database. This article will again focus on the Data Cleansing of the data.
Cleansing 2:Standardizing the datetime format
Sampling values of the data on the database, notice there is a inconsistency with the item Last_Update
. This item is a datetime data type, and we could see there were 3 ways the data was expressed in the data.
2020-03-27 22:14:55
2020-02-05T12:13:04
3/22/20 23:45
The first and the Second type is rather similar, but the "T" between the date and the time makes the two different.
This Last_Update
column gives how up-to-date the data [numbers of infected, death, and recovered] data from COVID19 is. Inconsistent date forms are a problem, so for this circumstance we have Standardized the date format to YYYY-MM-DD. We can change the Last_Update
value, but for this occasion, we added a JSON property Period
.
For example, below is the DB data before cleansing.
As below, we will process the data to add the column Period
Like the last article, processing the job below will result in the standardized datetime data.
Cleansing 3:Standardization of Country, and Region
Next, lets look at data inside the database for the Country_Region
column which defines the countries which hold the [infected, death, and recovered] data.
Looking at the [Country/Region] column value, notice some problems below.
Inconsistencies in the [Country/Region] format
First, examples of having the same country/region but different formats of writing.
Russia
, and Russian Federation
United Kingdom
, and UK
Cabo Verde
, and Cape Verde
Viet Nam
, and Vietnam
Guernsey
, and Channel Islands
also, there are Different formats for Special Administrative Region for China.
Hong Kong
, and Hong Kong SAR
Macao
, and Macao SAR
There were inconsistencies between the formats for having the word "Republic" in or not.
Ireland
, and Republic of Ireland
Czechia
, and Czech Republic
Moldova
, and Republic of Moldova
Iran
, and Iran (Islamic Republic of)
Korea, South
, Republic of Korea
, and South Korea
Inconsistencies between the formats for having the word "The" in or not, is also noticed.
Bahamas
, Bahamas, The
, and The Bahamas
Gambia
, Gambia, The
, and The Gambia
One of the major characteristics of this dataset is that there is cruise ship names shown in the country/region section.
Cruise Ship
Diamond Princess
MS Zaandam
There are columns that showed the "other" for unknown purposes as well.
Setting the Country/Region code schemes
Since the Johns Hopkins University Datasets have inconsistencies within their data formats, it is important to consider organizing the code scheme of the data, especially thinking about when combining the information with other datasets.
For the Country/Region code schemes, we chose the ISO3166-1 format. And for the process, we have decided to add a new CountryCode
column rather than changing the Country_Region
.
Notice there is the new value in the CountryCode
column below.
The data above is acquired by using the process below within the NoSQL database.
In the real code, lines 30~34 which represents the replacement words of each Country/Region is much longer. This way, Russia
and Russian Federation
will be replaced by RU
. Columns that are not stated in the ISO format, will be stated in the CountryCode
as the value of the Country_Region
, such as the Diamond Princess
.
Summary: Adding Newer columns as we go
Like the last article, we have focused on the schema-less database approach of “Storing, then cleansing data in steps” explaining the flexible nature of the database.
Especially this time, there are new columns added as new datasets come in. It is like adding columns to a table when using RDB as an example. Doing this without the system stopping, is a advantage of a schema-less, NoSQL database.
standardizing the“country/region” item by applying the existing ISO code scheme. This also made combining other datasets with ease since most datasets are using the ISO code scheme.
The next article will focus on how to create a line graph of the number of infected people(COVID-19) per country using the datasets we have been preparing in the last 3 articles.