Tabular data

Tabular data#

Tabular data is one of the most common formats for storing and organizing information. It is structured in a way that makes it easy to analyze and manipulate, consisting of rows and columns similar to a spreadsheet or database table. Each row in the table typically represents an individual record or observation, while each column represents a specific attribute or variable of those records.

Simple example#

ID	Name	Age	Subscription	Is Premium
101	Alice Johnson	28	2023-05-14	Yes
102	Bob Smith	34	2022-11-22	No
103	Carol Martinez	23	2024-01-08	Yes
104	David Lee	40	2021-07-19	No
105	Eva Chen	30	2023-03-30	Yes

Note that

columns ID and Age have numeric (integer) type
column Name contains string values
columnSubscription stores dates
the last column Is Premium is binary (only two possible values in this column Yes or No)

Types of tabular data#

Tabular data can be categorized into different types depending on the nature of the information stored.

Numerical data#

Columns with numerical data represent quantitative attributes that can be measured and expressed as numbers, e. g.

age
income
temperature
sales figures

Numerical data can further be divided into continuous and discrete.

Continuous variables#

Continuous variables can take any value within a given range, including fractions or decimals. They are typically associated with measurements that can be infinitely precise, depending on the level of measurement detail. Examples:

height/weight of a person
execution time of a program
distance traveled
speed of a vehicle
pressure in a tire
amount of rainfall
stock prices
exchange rates
blood glucose level

Discrete variables#

Discrete variables can only take specific, distinct values, often whole numbers. These values are countable, and there are no intermediate values between them. Discrete variables are usually related to counting:

number of employees in a company
number of products sold
pages in a book
goals scored in a soccer match
number of visits to a website
row index of a matrix/table
rating on a scale (e.g., from 1 to 5 stars)

In practice everything is discrete

Do not overestimate this theoretical distinction between continuous and discrete numeric data. In practical terms, many continuous variables are measured with a finite level of precision, making them effectively discrete. For example, currency is typically continuous in concept (you can have any amount of money), but when using a standard currency system like dollars and cents, it becomes discrete (e.g., $7.32 = 732 cents).

Categorical data#

This type of data represents qualitative attributes and is often divided into distinct categories. Categorical variables are divided into nominal and ordinal.

Categorical nominal variables#

Nominal data consists of categories with no inherent order or ranking. For example:

colors
fruits
gender
countries

Categorical ordinal variables#

Ordinal data includes categories with a meaningful order or ranking. Examples:

education level
customer satisfaction
movie rating
top-10 items suggested by a search engine

Tip

In machine learning the range of categorical variables with $K$ different values is usually denoted as $\{0, 1, \ldots, K-1\}$ or $\{1, \ldots, K\}$.

Text#

Text data includes any non-numeric, string-based information such as names, addresses, or descriptions.

Date/Time#

This type of data captures temporal information, such as dates and times. There are plenty formats for respresenting such data:

date only: 2023-08-26, 26/08/2023, 26-Aug-2023
time only: 13:35:45, 01:35 PM, 13:35:45.123
date and time: 2023/08/26 13:35:45, 26-Aug-2023 01:35 PM, 2023-08-26T13:35:45.123Z
Unix timestamp: 1693056945

CSV in `pandas`#

A csv-file (Comma Separated Values) are plain text files where each line represents a row, and columns are separated by commas. This simple format is widely used for storing tabular data. One of the easiest ways to open a csv-file in Python is pandas.

Tip

To install Python library pandas, run the command pip install pandas

`read_csv()`#

pd.read_csv() command reads a csv-file and returns its content as a pandas dataframe. Let’s read the Auto dataset:

import pandas as pd

# read Auto dataset
df_auto = pd.read_csv("../datasets/ISLP/Auto.csv")

# sample 7 random rows of the dataframe
df_auto.sample(7)

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
59	23.0	4	97.0	54	2254	23.5	72	2	volkswagen type 3
203	29.5	4	97.0	71	1825	12.2	76	2	volkswagen rabbit
191	22.0	6	225.0	100	3233	15.4	76	1	plymouth valiant
220	33.5	4	85.0	70	1945	16.8	77	3	datsun f-10 hatchback
326	43.4	4	90.0	48	2335	23.7	80	2	vw dasher (diesel)
324	40.8	4	85.0	65	2110	19.2	80	3	datsun 210
369	34.0	4	112.0	88	2395	18.0	82	1	chevrolet cavalier 2-door

Attribute shape stores size of the table:

df_auto.shape

(397, 9)

`info()` and `describe()`#

Method info() prints some information about columns of a dataset:

column names
number of present (not missed) values in each column
underlying type of each column

df_auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           397 non-null    float64
 1   cylinders     397 non-null    int64  
 2   displacement  397 non-null    float64
 3   horsepower    397 non-null    object 
 4   weight        397 non-null    int64  
 5   acceleration  397 non-null    float64
 6   year          397 non-null    int64  
 7   origin        397 non-null    int64  
 8   name          397 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.0+ KB

Method describe() out statistical information about all numeric columns of a dataset:

df_auto.describe()

	mpg	cylinders	displacement	weight	acceleration	year	origin
count	397.000000	397.000000	397.000000	397.000000	397.000000	397.000000	397.000000
mean	23.515869	5.458438	193.532746	2970.261965	15.555668	75.994962	1.574307
std	7.825804	1.701577	104.379583	847.904119	2.749995	3.690005	0.802549
min	9.000000	3.000000	68.000000	1613.000000	8.000000	70.000000	1.000000
25%	17.500000	4.000000	104.000000	2223.000000	13.800000	73.000000	1.000000
50%	23.000000	4.000000	146.000000	2800.000000	15.500000	76.000000	1.000000
75%	29.000000	8.000000	262.000000	3609.000000	17.100000	79.000000	2.000000
max	46.600000	8.000000	455.000000	5140.000000	24.800000	82.000000	3.000000

`pd.to_datetime()`#

The function to_datetime() allows to convert a column to date/time type. Here we convert numeric column year to datetime type:

df_auto['year'] = pd.to_datetime(df_auto['year'], format='%y')

# shows last 5 rows
df_auto.tail()

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
392	27.0	4	140.0	86	2790	15.6	1982-01-01	1	ford mustang gl
393	44.0	4	97.0	52	2130	24.6	1982-01-01	2	vw pickup
394	32.0	4	135.0	84	2295	11.6	1982-01-01	1	dodge rampage
395	28.0	4	120.0	79	2625	18.6	1982-01-01	1	ford ranger
396	31.0	4	119.0	82	2720	19.4	1982-01-01	1	chevy s-10

Now column year has special type datetime:

df_auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   mpg           397 non-null    float64       
 1   cylinders     397 non-null    int64         
 2   displacement  397 non-null    float64       
 3   horsepower    397 non-null    object        
 4   weight        397 non-null    int64         
 5   acceleration  397 non-null    float64       
 6   year          397 non-null    datetime64[ns]
 7   origin        397 non-null    int64         
 8   name          397 non-null    object        
dtypes: datetime64[ns](1), float64(3), int64(3), object(2)
memory usage: 28.0+ KB

Caution

Applying pd.datetime() to the Auto dataset does not make much sense and is given here only for educational purposes.

One-hot encoding#

Before feeding categorical data into machine learning models, we usually need to convert them to a numerical scale. The standard way to do it is to use a one-hot encoding, also called a dummy encoding.

If a feature belongs to the finite set $ \{1, \ldots, K\}$, it is encoded by a binary vector

\[ (\delta_1, \ldots, \delta_K) \in \{0, 1\}^K, \quad \sum\limits_{k=1}^K \delta _k = 1. \]

Thus, each categorical variable, which takes $K$ different values, is converted to $K$ numeric variables.

For example, consider feature wheathersit in Bikeshare dataset:

df_bike = pd.read_csv("../datasets/ISLP/Bikeshare.csv").drop("Unnamed: 0", axis=1)

# count all unique values in the column `weathersit`
df_bike['weathersit'].value_counts()

weathersit
clear              5645
cloudy/misty       2218
light rain/snow     781
heavy rain/snow       1
Name: count, dtype: int64

Here $K = 4$, so after OHE this feature will be replaced by $4$-dimensional dummy vector, e.g.:

clear = $(0, 0, 0, 1)$
cloudy/misty = $(0, 0, 1, 0)$
light rain/snow = $(0, 1, 0, 0)$
heavy rain/snow = $(1, 0, 0, 0)$

OHE can be performed by sklearn.preprocessing.OneHotEncoder:

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False)
ohe.fit_transform(df_bike[['weathersit']])[:15]

array([[1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.]])

Note

In fact, it is enough to have $K-1$ dummy variables since the value of $\delta_K$ can be automatically deduced from the values of $\delta_1, \ldots, \delta_{K-1}$. This is also called one-hot encoding with dropping

By default, OneHotEncoder retains all values of categorical features. To enable dropping, pass drop='first':

ohe = OneHotEncoder(sparse_output=False, drop='first')
ohe.fit_transform(df_bike[['weathersit']])[:15]

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [1., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])

Exercises#

Do we have any categorical variables in this table?
What are advantages and disadvantages of such encoding of a categorical varable? Does the answer depend on whether it is nominal or ordinal?
The column horsepower of the Auto dataset contains numbers, however, it is missed in the output of describe() method. Why?
Bikeshare dataset has several columns related to date/time:

# print first 5 rows
df_bike.head()

	season	mnth	day	hr	weekday	weathersit	temp	atemp	hum	casual	registered	bikers
0	1	Jan	1	0	6	clear	0.24	0.2879	0.81	3	13	16
1	1	Jan	1	1	6	clear	0.22	0.2727	0.80	8	32	40
2	1	Jan	1	2	6	clear	0.22	0.2727	0.80	5	27	32
3	1	Jan	1	3	6	clear	0.24	0.2879	0.75	3	10	13
4	1	Jan	1	4	6	clear	0.24	0.2879	0.75	0	1	1

Create a new column datetime which will store information about both date and time.

What can be said about linear dependence of the columns produced by one-hot-encoding? Consider two cases: with and without dropping.