Python libraries#
A brief description of selected Python packages and libraries which are useful for data analysis and machine learning.
Tip
In most cases, to install a Python package, one needs just to run the command pip install <package_name>
in the terminal.
Cloud resources#
You can also use clouds to run Jupyter Notebooks. Here are several popular solutions:
Data Analysis#
Pandas#
Pandas is a very popular library for data manipulation and analysis, providing data structures like DataFrames for handling structured data effectively.
import pandas as pd
pd.read_csv("../datasets/ISLP/Publication.csv").drop("Unnamed: 0", axis=1)
posres | multi | clinend | mech | sampsize | budget | impact | time | status | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | R01 | 39876 | 8.016941 | 44.016 | 11.203285 | 1 |
1 | 0 | 0 | 1 | R01 | 39876 | 8.016941 | 23.494 | 15.178645 | 1 |
2 | 0 | 0 | 1 | R01 | 8171 | 7.612606 | 8.391 | 24.410678 | 1 |
3 | 0 | 0 | 1 | Contract | 24335 | 11.771928 | 15.402 | 2.595483 | 1 |
4 | 0 | 0 | 1 | Contract | 33357 | 76.517537 | 16.783 | 8.607803 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
239 | 0 | 0 | 0 | R01 | 4105 | 2.703653 | 5.355 | 65.018480 | 1 |
240 | 1 | 0 | 0 | R44 | 181 | 1.117084 | 0.000 | 66.989733 | 0 |
241 | 0 | 0 | 0 | K23 | 104 | 0.472321 | 0.000 | 9.987680 | 0 |
242 | 0 | 0 | 0 | R21 | 69 | 0.404710 | 0.000 | 21.979466 | 0 |
243 | 1 | 0 | 0 | R01 | 1699 | 2.957751 | 0.000 | 4.632444 | 0 |
244 rows × 9 columns
Polars#
Polars is a fast DataFrames library designed for high-performance data analysis, offering a more efficient alternative to pandas
.
import polars as pl
pl.read_csv("../datasets/ISLP/Publication.csv").drop("")
posres | multi | clinend | mech | sampsize | budget | impact | time | status |
---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | str | i64 | f64 | f64 | f64 | i64 |
0 | 0 | 1 | "R01" | 39876 | 8.0169405 | 44.016 | 11.203285 | 1 |
0 | 0 | 1 | "R01" | 39876 | 8.0169405 | 23.494 | 15.178645 | 1 |
0 | 0 | 1 | "R01" | 8171 | 7.612606 | 8.391 | 24.410678 | 1 |
0 | 0 | 1 | "Contract" | 24335 | 11.771928 | 15.402 | 2.595483 | 1 |
0 | 0 | 1 | "Contract" | 33357 | 76.517537 | 16.783 | 8.607803 | 1 |
0 | 0 | 1 | "Contract" | 10355 | 9.809938 | 16.783 | 8.607803 | 1 |
0 | 1 | 0 | "U01" | 1704 | 23.818344 | 5.692 | 40.049281 | 1 |
1 | 0 | 0 | "R01" | 150 | 2.703848 | 3.496 | 27.071869 | 1 |
0 | 0 | 0 | "R01" | 135 | 3.454153 | 9.835 | 36.008214 | 1 |
0 | 1 | 0 | "Contract" | 423 | 11.154085 | 16.783 | 9.626283 | 1 |
0 | 0 | 1 | "Contract" | 4060 | 17.95 | 31.736 | 14.12731 | 1 |
0 | 1 | 1 | "Contract" | 2481 | 29.417184 | 24.831 | 25.034908 | 1 |
… | … | … | … | … | … | … | … | … |
1 | 0 | 0 | "R44" | 1800 | 0.988677 | 3.299 | 35.022587 | 1 |
1 | 0 | 0 | "R21" | 181 | 0.37509 | 0.0 | 4.008214 | 0 |
0 | 0 | 0 | "R01" | 66 | 1.15875 | 0.0 | 5.979466 | 0 |
0 | 0 | 0 | "P50" | 100 | 2.531065 | 0.0 | 11.86037 | 0 |
1 | 0 | 0 | "K23" | 99 | 0.601928 | 0.0 | 13.043121 | 0 |
1 | 0 | 0 | "R01" | 247 | 1.3054805 | 0.0 | 9.691992 | 1 |
1 | 0 | 0 | "R01" | 247 | 1.3054805 | 0.0 | 16.459959 | 1 |
0 | 0 | 0 | "R01" | 4105 | 2.703653 | 5.355 | 65.01848 | 1 |
1 | 0 | 0 | "R44" | 181 | 1.117084 | 0.0 | 66.989733 | 0 |
0 | 0 | 0 | "K23" | 104 | 0.472321 | 0.0 | 9.98768 | 0 |
0 | 0 | 0 | "R21" | 69 | 0.40471 | 0.0 | 21.979466 | 0 |
1 | 0 | 0 | "R01" | 1699 | 2.957751 | 0.0 | 4.632444 | 0 |
SQL#
SQL (Structured Query Language) is essential for managing and querying relational databases, which are foundational in many data-driven applications. SQL enables efficient data retrieval, manipulation, and storage, making it crucial for everything from small-scale projects to enterprise-level applications.
Top database engines (according to Stackoverflow survey):
In Python, several libraries offer robust support for interacting with SQL databases, allowing developers to seamlessly integrate SQL operations into their code.
sqlite3 module provides a built-in interface to interact with SQLite databases, making it ideal for small to medium-sized applications
SQLAlchemy allows developers to work with databases using Python objects, abstracting away the complexity of raw SQL while supporting a wide range of databases
PandasSQL is a submodule within Pandas that allows users to read from and write to SQL databases, enabling seamless integration of SQL operations with Pandas DataFrames for data analysis
PyMySQL is a pure-Python MySQL client library that enables Python applications to connect to MySQL and MariaDB databases, execute SQL queries, and manage database connections
Psycopg2 is a popular PostgreSQL adapter for Python
Data Visualization#
Matplotlib#
Matplotlib is a versatile plotting library that allows for the creation of static, animated, and interactive visualizations in Python.
Seaborn#
Seaborn is built on top of Matplotlib and provides a high-level interface for creating attractive and informative statistical graphics.
Plotly#
Plotly is an interactive visualization library that supports a wide range of chart types and offers tools for creating web-based visualizations.
Geopandas#
Geopandas extends Pandas to handle spatial data, making it easy to work with geographic datasets and create geographic visualizations.
Altair#
Altair is a declarative statistical visualization library that allows users to build complex visualizations using simple syntax, based on Vega and Vega-Lite.
import altair as alt
source = pd.read_csv("../datasets/ISLP/Auto.csv")
brush = alt.selection_interval(encodings=['x'])
points = alt.Chart(source).mark_point().encode(
x='horsepower:Q',
y='mpg:Q',
size='acceleration',
color=alt.condition(brush, 'origin:N', alt.value('lightgray'))
).add_params(brush)
bars = alt.Chart(source).mark_bar().encode(
y='origin:N',
color='origin:N',
x='count(origin):Q'
).transform_filter(brush)
points & bars
Mathematics#
NumPy#
NumPy offers support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently.
SciPy#
SciPy builds on NumPy by adding a suite of algorithms for optimization, integration, statistics, and other tasks in scientific computing.
SymPy#
SymPy is a Python library for symbolic mathematics, providing tools for algebraic manipulation, calculus, and other mathematical operations.
NetworkX#
NetworkX is a library for the creation, manipulation, and study of complex networks and graphs, with tools for analyzing their structure and behavior.
Classical machine learning#
Scikit-Learn#
sklearn is a widely-used library with lots of implemented classical machine learning algorithms.
XGBoost#
XGBoost is an optimized gradient boosting library designed to deliver high performance, particularly for decision tree algorithms.
CatBoost#
CatBoost is a gradient boosting library that is particularly strong in handling categorical features and is optimized for performance on structured datasets.
Deep learning#
PyTorch#
PyTorch is a deep learning framework that emphasizes flexibility and ease of use, particularly for research, with dynamic computation graphs.
TensorFlow#
TensorFlow is an end-to-end open-source platform for machine learning, particularly deep learning, offering a comprehensive ecosystem of tools and libraries.
Keras#
Keras is a high-level neural networks API that runs on top of TensorFlow, designed for rapid development and experimentation with deep learning models.
JAX#
JAX is a library for high-performance machine learning research, particularly in autograd and GPU acceleration, supporting fast and flexible computations.
Computer vision#
OpenCV#
OpenCV (Open Source Computer Vision Library) is a comprehensive and widely-used library that provides tools for real-time computer vision, image processing, and machine learning.
Scikit-image#
skimage is a collection of algorithms for image processing, built on top of SciPy, that provides easy-to-use tools for segmentation, filtering, morphology, and other computer vision tasks.
Pillow#
Pillow is a Python Imaging Library (PIL) fork that adds image processing capabilities to your Python interpreter, enabling tasks like opening, manipulating, and saving many different image file formats.
Detectron2#
Detectron2 is a high-performance library developed by Facebook AI Research for object detection, segmentation, and other computer vision tasks, built on top of PyTorch.
Natural language processing (NLP)#
NLTK#
NLTK (Natural Language Toolkit) is a comprehensive library for natural language processing, offering tools for text processing, tokenization, parsing, and more.
Gensim#
Gensim is used for topic modeling, word embeddings, and document similarity, with implementations for various NLP algorithms.
FastText#
FastText is a library developed by Facebook for efficient text classification and learning of word representations, with a focus on scalability.
Transformers#
Transformers is a library from Hugging Face that provides implementations of state-of-the-art NLP models, including BERT, GPT, and others.
OpenAI API#
The OpenAI API provides Python bindings for accessing OpenAI’s GPT models and other large language models, enabling a wide range of NLP applications.
LangChain#
LangChain is a framework for building applications that leverage language models, with tools for prompt engineering, memory management, and more.
Reinforcement Learning#
Stable Baselines3#
Stable Baselines3 is a set of implementations for popular reinforcement learning algorithms, built on top of PyTorch, with a focus on usability and reproducibility.
Gymnasium#
Gymnasium (formerly known as OpenAI Gym) is a toolkit for developing and comparing reinforcement learning algorithms, offering a variety of environments.
Ray RLlib#
Ray RLlib is a scalable library for reinforcement learning that integrates with the Ray distributed computing framework, offering a wide range of RL algorithms.
OpenAI Baselines#
OpenAI Baselines is a collection of high-quality implementations of various reinforcement learning algorithms, designed for performance and ease of use.