Python libraries#

A brief description of selected Python packages and libraries which are useful for data analysis and machine learning.

Tip

In most cases, to install a Python package, one needs just to run the command pip install <package_name> in the terminal.

The main tool

To install Jupyter, run

pip install notebook

or

pip install jupyterlab

Cloud resources#

You can also use clouds to run Jupyter Notebooks. Here are several popular solutions:

Data Analysis#

https://miro.medium.com/v2/resize:fit:1400/1*2EHqvZVV4qNjRqrHiBK9-A.png

Pandas#

Pandas is a very popular library for data manipulation and analysis, providing data structures like DataFrames for handling structured data effectively.

import pandas as pd
pd.read_csv("../datasets/ISLP/Publication.csv").drop("Unnamed: 0", axis=1)
posres multi clinend mech sampsize budget impact time status
0 0 0 1 R01 39876 8.016941 44.016 11.203285 1
1 0 0 1 R01 39876 8.016941 23.494 15.178645 1
2 0 0 1 R01 8171 7.612606 8.391 24.410678 1
3 0 0 1 Contract 24335 11.771928 15.402 2.595483 1
4 0 0 1 Contract 33357 76.517537 16.783 8.607803 1
... ... ... ... ... ... ... ... ... ...
239 0 0 0 R01 4105 2.703653 5.355 65.018480 1
240 1 0 0 R44 181 1.117084 0.000 66.989733 0
241 0 0 0 K23 104 0.472321 0.000 9.987680 0
242 0 0 0 R21 69 0.404710 0.000 21.979466 0
243 1 0 0 R01 1699 2.957751 0.000 4.632444 0

244 rows × 9 columns

Polars#

Polars is a fast DataFrames library designed for high-performance data analysis, offering a more efficient alternative to pandas.

https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c144f4a-7b53-4ba3-82d2-4e6b2b9811e6_2103x1962.png
import polars as pl
pl.read_csv("../datasets/ISLP/Publication.csv").drop("")
shape: (244, 9)
posresmulticlinendmechsampsizebudgetimpacttimestatus
i64i64i64stri64f64f64f64i64
001"R01"398768.016940544.01611.2032851
001"R01"398768.016940523.49415.1786451
001"R01"81717.6126068.39124.4106781
001"Contract"2433511.77192815.4022.5954831
001"Contract"3335776.51753716.7838.6078031
001"Contract"103559.80993816.7838.6078031
010"U01"170423.8183445.69240.0492811
100"R01"1502.7038483.49627.0718691
000"R01"1353.4541539.83536.0082141
010"Contract"42311.15408516.7839.6262831
001"Contract"406017.9531.73614.127311
011"Contract"248129.41718424.83125.0349081
100"R44"18000.9886773.29935.0225871
100"R21"1810.375090.04.0082140
000"R01"661.158750.05.9794660
000"P50"1002.5310650.011.860370
100"K23"990.6019280.013.0431210
100"R01"2471.30548050.09.6919921
100"R01"2471.30548050.016.4599591
000"R01"41052.7036535.35565.018481
100"R44"1811.1170840.066.9897330
000"K23"1040.4723210.09.987680
000"R21"690.404710.021.9794660
100"R01"16992.9577510.04.6324440

SQL#

SQL (Structured Query Language) is essential for managing and querying relational databases, which are foundational in many data-driven applications. SQL enables efficient data retrieval, manipulation, and storage, making it crucial for everything from small-scale projects to enterprise-level applications.

Top database engines (according to Stackoverflow survey):

../_images/stackoverflow-dev-survey-2024-database.png

In Python, several libraries offer robust support for interacting with SQL databases, allowing developers to seamlessly integrate SQL operations into their code.

  • sqlite3 module provides a built-in interface to interact with SQLite databases, making it ideal for small to medium-sized applications

  • SQLAlchemy allows developers to work with databases using Python objects, abstracting away the complexity of raw SQL while supporting a wide range of databases

  • PandasSQL is a submodule within Pandas that allows users to read from and write to SQL databases, enabling seamless integration of SQL operations with Pandas DataFrames for data analysis

  • PyMySQL is a pure-Python MySQL client library that enables Python applications to connect to MySQL and MariaDB databases, execute SQL queries, and manage database connections

  • Psycopg2 is a popular PostgreSQL adapter for Python

Data Visualization#

Matplotlib#

Matplotlib is a versatile plotting library that allows for the creation of static, animated, and interactive visualizations in Python.

Seaborn#

Seaborn is built on top of Matplotlib and provides a high-level interface for creating attractive and informative statistical graphics.

Plotly#

Plotly is an interactive visualization library that supports a wide range of chart types and offers tools for creating web-based visualizations.

Geopandas#

Geopandas extends Pandas to handle spatial data, making it easy to work with geographic datasets and create geographic visualizations.

Altair#

Altair is a declarative statistical visualization library that allows users to build complex visualizations using simple syntax, based on Vega and Vega-Lite.

import altair as alt

source = pd.read_csv("../datasets/ISLP/Auto.csv")

brush = alt.selection_interval(encodings=['x'])
points = alt.Chart(source).mark_point().encode(
    x='horsepower:Q',
    y='mpg:Q',
    size='acceleration',
    color=alt.condition(brush, 'origin:N', alt.value('lightgray'))
).add_params(brush)

bars = alt.Chart(source).mark_bar().encode(
    y='origin:N',
    color='origin:N',
    x='count(origin):Q'
).transform_filter(brush)
points & bars

Mathematics#

https://ucarecdn.com/809f51d2-0f2b-490e-8f92-e18193629245/

NumPy#

NumPy offers support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently.

SciPy#

SciPy builds on NumPy by adding a suite of algorithms for optimization, integration, statistics, and other tasks in scientific computing.

SymPy#

SymPy is a Python library for symbolic mathematics, providing tools for algebraic manipulation, calculus, and other mathematical operations.

NetworkX#

NetworkX is a library for the creation, manipulation, and study of complex networks and graphs, with tools for analyzing their structure and behavior.

Classical machine learning#

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSQUepHxnnIWJ_ptY9-wX6Mj-RIssE39vnevw&s

Scikit-Learn#

sklearn is a widely-used library with lots of implemented classical machine learning algorithms.

XGBoost#

XGBoost is an optimized gradient boosting library designed to deliver high performance, particularly for decision tree algorithms.

CatBoost#

CatBoost is a gradient boosting library that is particularly strong in handling categorical features and is optimized for performance on structured datasets.

Deep learning#

https://miro.medium.com/v2/resize:fit:1400/0*T6W0rRy8vgFU_K7Z.png

PyTorch#

PyTorch is a deep learning framework that emphasizes flexibility and ease of use, particularly for research, with dynamic computation graphs.

TensorFlow#

TensorFlow is an end-to-end open-source platform for machine learning, particularly deep learning, offering a comprehensive ecosystem of tools and libraries.

Keras#

Keras is a high-level neural networks API that runs on top of TensorFlow, designed for rapid development and experimentation with deep learning models.

JAX#

JAX is a library for high-performance machine learning research, particularly in autograd and GPU acceleration, supporting fast and flexible computations.

Computer vision#

https://www.repeato.app/wp-content/uploads/2024/06/AI-computer-vision-automation.jpg

OpenCV#

OpenCV (Open Source Computer Vision Library) is a comprehensive and widely-used library that provides tools for real-time computer vision, image processing, and machine learning.

Scikit-image#

skimage is a collection of algorithms for image processing, built on top of SciPy, that provides easy-to-use tools for segmentation, filtering, morphology, and other computer vision tasks.

Pillow#

Pillow is a Python Imaging Library (PIL) fork that adds image processing capabilities to your Python interpreter, enabling tasks like opening, manipulating, and saving many different image file formats.

Detectron2#

Detectron2 is a high-performance library developed by Facebook AI Research for object detection, segmentation, and other computer vision tasks, built on top of PyTorch.

Natural language processing (NLP)#

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ9-VgaTI779jH8wtJpiJPcqYlke_4KNLIn2Q&s

NLTK#

NLTK (Natural Language Toolkit) is a comprehensive library for natural language processing, offering tools for text processing, tokenization, parsing, and more.

Gensim#

Gensim is used for topic modeling, word embeddings, and document similarity, with implementations for various NLP algorithms.

FastText#

FastText is a library developed by Facebook for efficient text classification and learning of word representations, with a focus on scalability.

Transformers#

Transformers is a library from Hugging Face that provides implementations of state-of-the-art NLP models, including BERT, GPT, and others.

OpenAI API#

The OpenAI API provides Python bindings for accessing OpenAI’s GPT models and other large language models, enabling a wide range of NLP applications.

LangChain#

LangChain is a framework for building applications that leverage language models, with tools for prompt engineering, memory management, and more.

Reinforcement Learning#

https://www.kdnuggets.com/wp-content/uploads/awan_reinforcement_learning_newbies_1.png}

Stable Baselines3#

Stable Baselines3 is a set of implementations for popular reinforcement learning algorithms, built on top of PyTorch, with a focus on usability and reproducibility.

Gymnasium#

Gymnasium (formerly known as OpenAI Gym) is a toolkit for developing and comparing reinforcement learning algorithms, offering a variety of environments.

Ray RLlib#

Ray RLlib is a scalable library for reinforcement learning that integrates with the Ray distributed computing framework, offering a wide range of RL algorithms.

OpenAI Baselines#

OpenAI Baselines is a collection of high-quality implementations of various reinforcement learning algorithms, designed for performance and ease of use.