Tabular data#
Tabular data is one of the most common formats for storing and organizing information. It is structured in a way that makes it easy to analyze and manipulate, consisting of rows and columns similar to a spreadsheet or database table. Each row in the table typically represents an individual record or observation, while each column represents a specific attribute or variable of those records.
Simple example#
ID |
Name |
Age |
Subscription |
Is Premium |
---|---|---|---|---|
101 |
Alice Johnson |
28 |
2023-05-14 |
Yes |
102 |
Bob Smith |
34 |
2022-11-22 |
No |
103 |
Carol Martinez |
23 |
2024-01-08 |
Yes |
104 |
David Lee |
40 |
2021-07-19 |
No |
105 |
Eva Chen |
30 |
2023-03-30 |
Yes |
Note that
columns
ID
andAge
have numeric (integer) typecolumn
Name
contains string valuescolumn
Subscription
stores datesthe last column
Is Premium
is binary (only two possible values in this columnYes
orNo
)
Types of tabular data#
Tabular data can be categorized into different types depending on the nature of the information stored.
Numerical data#
Columns with numerical data represent quantitative attributes that can be measured and expressed as numbers, e. g.
age
income
temperature
sales figures
Numerical data can further be divided into continuous and discrete.
Continuous variables#
Continuous variables can take any value within a given range, including fractions or decimals. They are typically associated with measurements that can be infinitely precise, depending on the level of measurement detail. Examples:
height/weight of a person
execution time of a program
distance traveled
speed of a vehicle
pressure in a tire
amount of rainfall
stock prices
exchange rates
blood glucose level
Discrete variables#
Discrete variables can only take specific, distinct values, often whole numbers. These values are countable, and there are no intermediate values between them. Discrete variables are usually related to counting:
number of employees in a company
number of products sold
pages in a book
goals scored in a soccer match
number of visits to a website
row index of a matrix/table
rating on a scale (e.g., from 1 to 5 stars)
In practice everything is discrete
Do not overestimate this theoretical distinction between continuous and discrete numeric data. In practical terms, many continuous variables are measured with a finite level of precision, making them effectively discrete. For example, currency is typically continuous in concept (you can have any amount of money), but when using a standard currency system like dollars and cents, it becomes discrete (e.g., $7.32 = 732 cents).
Categorical data#
This type of data represents qualitative attributes and is often divided into distinct categories. Categorical variables are divided into nominal and ordinal.
Categorical nominal variables#
Nominal data consists of categories with no inherent order or ranking. For example:
colors
fruits
gender
countries
Categorical ordinal variables#
Ordinal data includes categories with a meaningful order or ranking. Examples:
education level
customer satisfaction
movie rating
top-10 items suggested by a search engine
Tip
In machine learning the range of categorical variables with \(K\) different values is usually denoted as \(\{0, 1, \ldots, K-1\}\) or \(\{1, \ldots, K\}\).
Text#
Text data includes any non-numeric, string-based information such as names, addresses, or descriptions.
Date/Time#
This type of data captures temporal information, such as dates and times. There are plenty formats for respresenting such data:
date only:
2023-08-26
,26/08/2023
,26-Aug-2023
time only:
13:35:45
,01:35 PM
,13:35:45.123
date and time:
2023/08/26 13:35:45
,26-Aug-2023 01:35 PM
,2023-08-26T13:35:45.123Z
Unix timestamp:
1693056945
CSV in pandas
#
A csv-file (Comma Separated Values) are plain text files where each line represents a row, and columns are separated by commas. This simple format is widely used for storing tabular data. One of the easiest ways to open a csv-file in Python is pandas.
Tip
To install Python library pandas
, run the command pip install pandas
read_csv()
#
pd.read_csv() command reads a csv-file and returns its content as a pandas dataframe. Let’s read the Auto dataset:
import pandas as pd
# read Auto dataset
df_auto = pd.read_csv("../datasets/ISLP/Auto.csv")
# sample 7 random rows of the dataframe
df_auto.sample(7)
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
59 | 23.0 | 4 | 97.0 | 54 | 2254 | 23.5 | 72 | 2 | volkswagen type 3 |
203 | 29.5 | 4 | 97.0 | 71 | 1825 | 12.2 | 76 | 2 | volkswagen rabbit |
191 | 22.0 | 6 | 225.0 | 100 | 3233 | 15.4 | 76 | 1 | plymouth valiant |
220 | 33.5 | 4 | 85.0 | 70 | 1945 | 16.8 | 77 | 3 | datsun f-10 hatchback |
326 | 43.4 | 4 | 90.0 | 48 | 2335 | 23.7 | 80 | 2 | vw dasher (diesel) |
324 | 40.8 | 4 | 85.0 | 65 | 2110 | 19.2 | 80 | 3 | datsun 210 |
369 | 34.0 | 4 | 112.0 | 88 | 2395 | 18.0 | 82 | 1 | chevrolet cavalier 2-door |
Attribute shape
stores size of the table:
df_auto.shape
(397, 9)
info()
and describe()
#
Method info()
prints some information about columns of a dataset:
column names
number of present (not missed) values in each column
underlying type of each column
df_auto.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mpg 397 non-null float64
1 cylinders 397 non-null int64
2 displacement 397 non-null float64
3 horsepower 397 non-null object
4 weight 397 non-null int64
5 acceleration 397 non-null float64
6 year 397 non-null int64
7 origin 397 non-null int64
8 name 397 non-null object
dtypes: float64(3), int64(4), object(2)
memory usage: 28.0+ KB
Method describe()
out statistical information about all numeric columns of a dataset:
df_auto.describe()
mpg | cylinders | displacement | weight | acceleration | year | origin | |
---|---|---|---|---|---|---|---|
count | 397.000000 | 397.000000 | 397.000000 | 397.000000 | 397.000000 | 397.000000 | 397.000000 |
mean | 23.515869 | 5.458438 | 193.532746 | 2970.261965 | 15.555668 | 75.994962 | 1.574307 |
std | 7.825804 | 1.701577 | 104.379583 | 847.904119 | 2.749995 | 3.690005 | 0.802549 |
min | 9.000000 | 3.000000 | 68.000000 | 1613.000000 | 8.000000 | 70.000000 | 1.000000 |
25% | 17.500000 | 4.000000 | 104.000000 | 2223.000000 | 13.800000 | 73.000000 | 1.000000 |
50% | 23.000000 | 4.000000 | 146.000000 | 2800.000000 | 15.500000 | 76.000000 | 1.000000 |
75% | 29.000000 | 8.000000 | 262.000000 | 3609.000000 | 17.100000 | 79.000000 | 2.000000 |
max | 46.600000 | 8.000000 | 455.000000 | 5140.000000 | 24.800000 | 82.000000 | 3.000000 |
pd.to_datetime()
#
The function to_datetime() allows to convert a column to date/time type. Here we convert numeric column year
to datetime
type:
df_auto['year'] = pd.to_datetime(df_auto['year'], format='%y')
# shows last 5 rows
df_auto.tail()
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
392 | 27.0 | 4 | 140.0 | 86 | 2790 | 15.6 | 1982-01-01 | 1 | ford mustang gl |
393 | 44.0 | 4 | 97.0 | 52 | 2130 | 24.6 | 1982-01-01 | 2 | vw pickup |
394 | 32.0 | 4 | 135.0 | 84 | 2295 | 11.6 | 1982-01-01 | 1 | dodge rampage |
395 | 28.0 | 4 | 120.0 | 79 | 2625 | 18.6 | 1982-01-01 | 1 | ford ranger |
396 | 31.0 | 4 | 119.0 | 82 | 2720 | 19.4 | 1982-01-01 | 1 | chevy s-10 |
Now column year
has special type datetime
:
df_auto.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mpg 397 non-null float64
1 cylinders 397 non-null int64
2 displacement 397 non-null float64
3 horsepower 397 non-null object
4 weight 397 non-null int64
5 acceleration 397 non-null float64
6 year 397 non-null datetime64[ns]
7 origin 397 non-null int64
8 name 397 non-null object
dtypes: datetime64[ns](1), float64(3), int64(3), object(2)
memory usage: 28.0+ KB
Caution
Applying pd.datetime()
to the Auto dataset does not make much sense and is given here only for educational purposes.
One-hot encoding#
Before feeding categorical data into machine learning models, we usually need to convert them to a numerical scale. The standard way to do it is to use a one-hot encoding, also called a dummy encoding.
If a feature belongs to the finite set \( \{1, \ldots, K\}\), it is encoded by a binary vector
Thus, each categorical variable, which takes \(K\) different values, is converted to \(K\) numeric variables.
For example, consider feature wheathersit
in Bikeshare dataset:
df_bike = pd.read_csv("../datasets/ISLP/Bikeshare.csv").drop("Unnamed: 0", axis=1)
# count all unique values in the column `weathersit`
df_bike['weathersit'].value_counts()
weathersit
clear 5645
cloudy/misty 2218
light rain/snow 781
heavy rain/snow 1
Name: count, dtype: int64
Here \(K = 4\), so after OHE this feature will be replaced by \(4\)-dimensional dummy vector, e.g.:
clear
= \((0, 0, 0, 1)\)cloudy/misty
= \((0, 0, 1, 0)\)light rain/snow
= \((0, 1, 0, 0)\)heavy rain/snow
= \((1, 0, 0, 0)\)
OHE can be performed by sklearn.preprocessing.OneHotEncoder
:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)
ohe.fit_transform(df_bike[['weathersit']])[:15]
array([[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 1., 0., 0.]])
Note
In fact, it is enough to have \(K-1\) dummy variables since the value of \(\delta_K\) can be automatically deduced from the values of \(\delta_1, \ldots, \delta_{K-1}\). This is also called one-hot encoding with dropping
By default, OneHotEncoder
retains all values of categorical features. To enable dropping, pass drop='first'
:
ohe = OneHotEncoder(sparse_output=False, drop='first')
ohe.fit_transform(df_bike[['weathersit']])[:15]
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[1., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[1., 0., 0.],
[1., 0., 0.]])
Exercises#
Do we have any categorical variables in this table?
What are advantages and disadvantages of such encoding of a categorical varable? Does the answer depend on whether it is nominal or ordinal?
The column
horsepower
of the Auto dataset contains numbers, however, it is missed in the output ofdescribe()
method. Why?Bikeshare dataset has several columns related to date/time:
# print first 5 rows
df_bike.head()
season | mnth | day | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | bikers | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Jan | 1 | 0 | 0 | 6 | 0 | clear | 0.24 | 0.2879 | 0.81 | 0.0 | 3 | 13 | 16 |
1 | 1 | Jan | 1 | 1 | 0 | 6 | 0 | clear | 0.22 | 0.2727 | 0.80 | 0.0 | 8 | 32 | 40 |
2 | 1 | Jan | 1 | 2 | 0 | 6 | 0 | clear | 0.22 | 0.2727 | 0.80 | 0.0 | 5 | 27 | 32 |
3 | 1 | Jan | 1 | 3 | 0 | 6 | 0 | clear | 0.24 | 0.2879 | 0.75 | 0.0 | 3 | 10 | 13 |
4 | 1 | Jan | 1 | 4 | 0 | 6 | 0 | clear | 0.24 | 0.2879 | 0.75 | 0.0 | 0 | 1 | 1 |
Create a new column datetime
which will store information about both date and time.
What can be said about linear dependence of the columns produced by one-hot-encoding? Consider two cases: with and without dropping.