Tabular data#

Tabular data is one of the most common formats for storing and organizing information. It is structured in a way that makes it easy to analyze and manipulate, consisting of rows and columns similar to a spreadsheet or database table. Each row in the table typically represents an individual record or observation, while each column represents a specific attribute or variable of those records.

Simple example#

ID

Name

Age

Subscription

Is Premium

101

Alice Johnson

28

2023-05-14

Yes

102

Bob Smith

34

2022-11-22

No

103

Carol Martinez

23

2024-01-08

Yes

104

David Lee

40

2021-07-19

No

105

Eva Chen

30

2023-03-30

Yes

Note that

  • columns ID and Age have numeric (integer) type

  • column Name contains string values

  • columnSubscription stores dates

  • the last column Is Premium is binary (only two possible values in this column Yes or No)

Types of tabular data#

Tabular data can be categorized into different types depending on the nature of the information stored.

Numerical data#

Columns with numerical data represent quantitative attributes that can be measured and expressed as numbers, e. g.

  • age

  • income

  • temperature

  • sales figures

Numerical data can further be divided into continuous and discrete.

Continuous variables#

Continuous variables can take any value within a given range, including fractions or decimals. They are typically associated with measurements that can be infinitely precise, depending on the level of measurement detail. Examples:

  • height/weight of a person

  • execution time of a program

  • distance traveled

  • speed of a vehicle

  • pressure in a tire

  • amount of rainfall

  • stock prices

  • exchange rates

  • blood glucose level

Discrete variables#

Discrete variables can only take specific, distinct values, often whole numbers. These values are countable, and there are no intermediate values between them. Discrete variables are usually related to counting:

  • number of employees in a company

  • number of products sold

  • pages in a book

  • goals scored in a soccer match

  • number of visits to a website

  • row index of a matrix/table

  • rating on a scale (e.g., from 1 to 5 stars)

Categorical data#

This type of data represents qualitative attributes and is often divided into distinct categories. Categorical variables are divided into nominal and ordinal.

Categorical nominal variables#

Nominal data consists of categories with no inherent order or ranking. For example:

  • colors

  • fruits

  • gender

  • countries

Categorical ordinal variables#

Ordinal data includes categories with a meaningful order or ranking. Examples:

  • education level

  • customer satisfaction

  • movie rating

  • top-10 items suggested by a search engine

Tip

In machine learning the range of categorical variables with \(K\) different values is usually denoted as \(\{0, 1, \ldots, K-1\}\) or \(\{1, \ldots, K\}\).

Text#

Text data includes any non-numeric, string-based information such as names, addresses, or descriptions.

Date/Time#

This type of data captures temporal information, such as dates and times. There are plenty formats for respresenting such data:

  • date only: 2023-08-26, 26/08/2023, 26-Aug-2023

  • time only: 13:35:45, 01:35 PM, 13:35:45.123

  • date and time: 2023/08/26 13:35:45, 26-Aug-2023 01:35 PM, 2023-08-26T13:35:45.123Z

  • Unix timestamp: 1693056945

CSV in pandas#

A csv-file (Comma Separated Values) are plain text files where each line represents a row, and columns are separated by commas. This simple format is widely used for storing tabular data. One of the easiest ways to open a csv-file in Python is pandas.

Tip

To install Python library pandas, run the command pip install pandas

read_csv()#

pd.read_csv() command reads a csv-file and returns its content as a pandas dataframe. Let’s read the Auto dataset:

import pandas as pd

# read Auto dataset
df_auto = pd.read_csv("../datasets/ISLP/Auto.csv")

# sample 7 random rows of the dataframe
df_auto.sample(7)
mpg cylinders displacement horsepower weight acceleration year origin name
59 23.0 4 97.0 54 2254 23.5 72 2 volkswagen type 3
203 29.5 4 97.0 71 1825 12.2 76 2 volkswagen rabbit
191 22.0 6 225.0 100 3233 15.4 76 1 plymouth valiant
220 33.5 4 85.0 70 1945 16.8 77 3 datsun f-10 hatchback
326 43.4 4 90.0 48 2335 23.7 80 2 vw dasher (diesel)
324 40.8 4 85.0 65 2110 19.2 80 3 datsun 210
369 34.0 4 112.0 88 2395 18.0 82 1 chevrolet cavalier 2-door

Attribute shape stores size of the table:

df_auto.shape
(397, 9)

info() and describe()#

Method info() prints some information about columns of a dataset:

  • column names

  • number of present (not missed) values in each column

  • underlying type of each column

df_auto.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           397 non-null    float64
 1   cylinders     397 non-null    int64  
 2   displacement  397 non-null    float64
 3   horsepower    397 non-null    object 
 4   weight        397 non-null    int64  
 5   acceleration  397 non-null    float64
 6   year          397 non-null    int64  
 7   origin        397 non-null    int64  
 8   name          397 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.0+ KB

Method describe() out statistical information about all numeric columns of a dataset:

df_auto.describe()
mpg cylinders displacement weight acceleration year origin
count 397.000000 397.000000 397.000000 397.000000 397.000000 397.000000 397.000000
mean 23.515869 5.458438 193.532746 2970.261965 15.555668 75.994962 1.574307
std 7.825804 1.701577 104.379583 847.904119 2.749995 3.690005 0.802549
min 9.000000 3.000000 68.000000 1613.000000 8.000000 70.000000 1.000000
25% 17.500000 4.000000 104.000000 2223.000000 13.800000 73.000000 1.000000
50% 23.000000 4.000000 146.000000 2800.000000 15.500000 76.000000 1.000000
75% 29.000000 8.000000 262.000000 3609.000000 17.100000 79.000000 2.000000
max 46.600000 8.000000 455.000000 5140.000000 24.800000 82.000000 3.000000

pd.to_datetime()#

The function to_datetime() allows to convert a column to date/time type. Here we convert numeric column year to datetime type:

df_auto['year'] = pd.to_datetime(df_auto['year'], format='%y')

# shows last 5 rows
df_auto.tail()
mpg cylinders displacement horsepower weight acceleration year origin name
392 27.0 4 140.0 86 2790 15.6 1982-01-01 1 ford mustang gl
393 44.0 4 97.0 52 2130 24.6 1982-01-01 2 vw pickup
394 32.0 4 135.0 84 2295 11.6 1982-01-01 1 dodge rampage
395 28.0 4 120.0 79 2625 18.6 1982-01-01 1 ford ranger
396 31.0 4 119.0 82 2720 19.4 1982-01-01 1 chevy s-10

Now column year has special type datetime:

df_auto.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   mpg           397 non-null    float64       
 1   cylinders     397 non-null    int64         
 2   displacement  397 non-null    float64       
 3   horsepower    397 non-null    object        
 4   weight        397 non-null    int64         
 5   acceleration  397 non-null    float64       
 6   year          397 non-null    datetime64[ns]
 7   origin        397 non-null    int64         
 8   name          397 non-null    object        
dtypes: datetime64[ns](1), float64(3), int64(3), object(2)
memory usage: 28.0+ KB

Caution

Applying pd.datetime() to the Auto dataset does not make much sense and is given here only for educational purposes.

One-hot encoding#

Before feeding categorical data into machine learning models, we usually need to convert them to a numerical scale. The standard way to do it is to use a one-hot encoding, also called a dummy encoding.

If a feature belongs to the finite set \( \{1, \ldots, K\}\), it is encoded by a binary vector

\[ (\delta_1, \ldots, \delta_K) \in \{0, 1\}^K, \quad \sum\limits_{k=1}^K \delta _k = 1. \]

Thus, each categorical variable, which takes \(K\) different values, is converted to \(K\) numeric variables.

For example, consider feature wheathersit in Bikeshare dataset:

df_bike = pd.read_csv("../datasets/ISLP/Bikeshare.csv").drop("Unnamed: 0", axis=1)

# count all unique values in the column `weathersit`
df_bike['weathersit'].value_counts()
weathersit
clear              5645
cloudy/misty       2218
light rain/snow     781
heavy rain/snow       1
Name: count, dtype: int64

Here \(K = 4\), so after OHE this feature will be replaced by \(4\)-dimensional dummy vector, e.g.:

  • clear = \((0, 0, 0, 1)\)

  • cloudy/misty = \((0, 0, 1, 0)\)

  • light rain/snow = \((0, 1, 0, 0)\)

  • heavy rain/snow = \((1, 0, 0, 0)\)

OHE can be performed by sklearn.preprocessing.OneHotEncoder:

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False)
ohe.fit_transform(df_bike[['weathersit']])[:15]
array([[1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.]])

Note

In fact, it is enough to have \(K-1\) dummy variables since the value of \(\delta_K\) can be automatically deduced from the values of \(\delta_1, \ldots, \delta_{K-1}\). This is also called one-hot encoding with dropping

By default, OneHotEncoder retains all values of categorical features. To enable dropping, pass drop='first':

ohe = OneHotEncoder(sparse_output=False, drop='first')
ohe.fit_transform(df_bike[['weathersit']])[:15]
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [1., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])

Exercises#

  1. Do we have any categorical variables in this table?

  2. What are advantages and disadvantages of such encoding of a categorical varable? Does the answer depend on whether it is nominal or ordinal?

  3. The column horsepower of the Auto dataset contains numbers, however, it is missed in the output of describe() method. Why?

  4. Bikeshare dataset has several columns related to date/time:

# print first 5 rows
df_bike.head()
season mnth day hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered bikers
0 1 Jan 1 0 0 6 0 clear 0.24 0.2879 0.81 0.0 3 13 16
1 1 Jan 1 1 0 6 0 clear 0.22 0.2727 0.80 0.0 8 32 40
2 1 Jan 1 2 0 6 0 clear 0.22 0.2727 0.80 0.0 5 27 32
3 1 Jan 1 3 0 6 0 clear 0.24 0.2879 0.75 0.0 3 10 13
4 1 Jan 1 4 0 6 0 clear 0.24 0.2879 0.75 0.0 0 1 1

Create a new column datetime which will store information about both date and time.

  1. What can be said about linear dependence of the columns produced by one-hot-encoding? Consider two cases: with and without dropping.