(4) Standardization and cleansing of code schemes
The last article was about the cleansing stage. Standardizing Johns Hopkins University Datasets that had changes upon their column names(JSON properties) as newer the datasets became on the NoSQL database. This article will again focus on the Data Cleansing of the data.
Cleansing 2：Standardizing the datetime format
Sampling values of the data on the database, notice there is a inconsistency with the item
Last_Update. This item is a datetime data type, and we could see there were 3 ways the data was expressed in the data.
The first and the Second type is rather similar, but the "T" between the date and the time makes the two different.
Last_Update column gives how up-to-date the data [numbers of infected, death, and recovered] data from COVID19 is. Inconsistent date forms are a problem, so for this circumstance we have Standardized the date format to YYYY-MM-DD. We can change the
Last_Update value, but for this occasion, we added a JSON property
For example, below is the DB data before cleansing.
As below, we will process the data to add the column
Like the last article, processing the job below will result in the standardized datetime data.
Cleansing 3：Standardization of Country, and Region
Next, lets look at data inside the database for the
Country_Region column which defines the countries which hold the [infected, death, and recovered] data.
Looking at the [Country/Region] column value, notice some problems below.
Inconsistencies in the [Country/Region] format
First, examples of having the same country/region but different formats of writing.
United Kingdom, and
Cabo Verde, and
Viet Nam, and
also, there are Different formats for Special Administrative Region for China.
Hong Kong, and
Hong Kong SAR
There were inconsistencies between the formats for having the word "Republic" in or not.
Republic of Ireland
Republic of Moldova
Iran (Islamic Republic of)
Republic of Korea, and
Inconsistencies between the formats for having the word "The" in or not, is also noticed.
Bahamas, The, and
Gambia, The, and
One of the major characteristics of this dataset is that there is cruise ship names shown in the country/region section.
There are columns that showed the "other" for unknown purposes as well.
Setting the Country/Region code schemes
Since the Johns Hopkins University Datasets have inconsistencies within their data formats, it is important to consider organizing the code scheme of the data, especially thinking about when combining the information with other datasets.
For the Country/Region code schemes, we chose the ISO3166-1 format. And for the process, we have decided to add a new
CountryCode column rather than changing the
Notice there is the new value in the
CountryCode column below.
The data above is acquired by using the process below within the NoSQL database.
In the real code, lines 30~34 which represents the replacement words of each Country/Region is much longer. This way,
Russian Federation will be replaced by
RU. Columns that are not stated in the ISO format, will be stated in the
CountryCode as the value of the
Country_Region, such as the
Summary: Adding Newer columns as we go
Like the last article, we have focused on the schema-less database approach of “Storing, then cleansing data in steps” explaining the flexible nature of the database.
Especially this time, there are new columns added as new datasets come in. It is like adding columns to a table when using RDB as an example. Doing this without the system stopping, is a advantage of a schema-less, NoSQL database.
standardizing the“country/region” item by applying the existing ISO code scheme. This also made combining other datasets with ease since most datasets are using the ISO code scheme.
The next article will focus on how to create a line graph of the number of infected people(COVID-19) per country using the datasets we have been preparing in the last 3 articles.