Data Exploration

Data Exploration

During this exercise, you will learn how to use the Metadata Explorer to explore your data sources. By profiling, previewing and viewing the metadata, you can identify data that needs cleansing or you can find commonalities in the data. In this exercise, you will publish a dataset from Amazon S3 to your Metadata Catalog. Metadata is information about your data, such as column names, connection information, object types, column data types, etc.

  1. Launch the Metadata Explorer.

  2. Click on Browse Connections.

  3. Click the S3 Connection CLOUD_STORAGE.

  4. Navigate to the folder /DI_ML/TA/exercises_{your group}/. You see three files: customers.csv, devices.csv, events.parquet.

    • Your group number is the last 2 digits of your username.

  5. Select the devices.csv file and press the Start Profile option.

  6. Select Yes in the dialog.

  7. Next for the same file select New Publication Action. We will now add the metadata information to our Metadata catalog.

  8. On the Publications tab enter the publication name & description as below:

    • NAME: devices_##Your username## e.g. devices_ACXXXXUXX
    • Click on the Publish button.

  9. By now (hopefully) Profiling has been completed. You will see a notification like this on the top right corner of the screen. If not, please wait.

  10. Select the Option menu and click View Fact Sheet.

  11. You should see the adjacent detailed screen.

  12. Select Columns tab. In this overview you can see the metadata of this dataset and the values of the fields.

  13. Select the COUNTRY column.

  14. As displayed there are a large number of empty values for this attribute. This dataset needs to be enriched.

  15. Let’s have a look at the Data Preview. Indeed null/empty values exist in the country field.

  1. Return to the home page of the Metadata Explorer.

  2. In the home page search for country.

    • Note: Publication should have finished in the earlier step for this to work!

  3. All datasets which contains Country as or an annotation column name or has Country in the name will be part of the result.


Well Done! You have explored the devices.csv file, got insights of all datasets available to you and searched the metadata catalog.