Contact:
Social networks:
PyData Prague is a community of data scientists, engineers, analysts, and various other developers in the area of scientific computing and data analysis. The term PyData refers to an educational program of NumFOCUS, an american non-profit helping open source software in terms of governance, financial support, and operations.
The PyData network hosts meetups in hundreds of cities around the world and several conferences each year. The Prague chapter started in 2018 with the aim of spreading the word of open source scientific computing in the Czech Republic. And while the chapter is based in Prague, we operate and collaborate countrywide.
We adhere to PyData’s code of conduct, here’s its short version:
Be kind to others. Do not insult or put down others. Behave professionally. Remember that harassment and sexist, racist, or exclusionary jokes and language are not appropriate for PyData.
All communication should be appropriate for a professional audience including people of many different backgrounds. Sexual language and imagery is not appropriate.
PyData is dedicated to providing a harassment-free event experience for everyone, regardless of gender, sexual orientation, gender identity, and expression, disability, physical appearance, body size, race, or religion. We do not tolerate harassment of participants in any form.
Thank you for helping make this a welcoming, friendly community for all.
You can find more information at pydata.org/code-of-conduct/
We have hosted several meetups, you can check them out on our Meetup page. We try and host speakers from various backgrounds, so that our attendees get to find out about all sorts of things possible with the NumFOCUS toolkit (and beyond!).
In this talk, we’ll dive into the essentials of unsupervised anomaly detection for multivariate time series data. As organizations rely increasingly on proactive monitoring, anomaly detection has become vital across domains—from fraud detection to predictive maintenance. This session will cover key challenges in anomaly detection, including data drift, dimensionality, and lack of labeled data. We’ll explore foundational techniques and algorithms, from traditional statistical methods to state-of-the-art machine learning models, offering guidance on selecting tools, setting evaluation metrics, and applying these concepts in real-world scenarios.
I developed poliastro, a Python library for interactive Astrodynamics, for almost a decade. Seemingly everything was going well, the project had traction, GitHub stars were going up, I got a regular stream of grant money… but at some point I decided to step down as a maintainer. What happened?
Weed detection techniques are essential in modern precision agriculture, where accurately detecting and identifying species can lead to effective crop yields and sustainable resource use. Python has emerged as a powerful tool in this domain, providing versatile libraries and frameworks for data gathering, processing, and model training. By leveraging Python’s capabilities, we can efficiently manage large-scale datasets, preprocess images, and potentially automate the annotation process, streamlining the development of machine learning models for weed detection. The integration of computer vision and machine learning tools, including OpenCV, scikit-image, and TensorFlow, serves as a basis for creation of models capable of distinguishing between crops and weed species with high precision. As a result, Python-driven weed detection models offer a promising path toward improved crop health, resource conservation, and sustainable farming practices.
Building production-ready LLM systems is challenging, but designing solutions that comply with the EU AI Act introduces layers of complexity far beyond prompt engineering and API integration. This talk takes you through the journey of creating a compliant LLM system for one of the most demanding domains: AI-driven recruitment. We share our journey from a naive PoC to a production ready solution under the AI Act. Through this case study, we will explore:
Attendees will leave equipped with practical tools, architectural patterns for building LLM systems that meet both engineering and regulatory challenges, supported by real-world example.
What if we could assign different aspects of a problem to AI models that specialize in those areas, much like delegating tasks to skilled team members? Multi-agent systems (MAS) are enabling this kind of collaboration among computer programs. In this session, we’ll introduce you to LangGraph and Autogen—two leading frameworks that organize AI models into teams of specialists. We’ll explore how each tool structures these teams, manages communication, and measures performance, helping you choose the right solution for various projects.
Is it possible to use git as a database? Yes, we do that. It’s both awesome and terrible. What other weird choices have we made on our way? Come and listen to our story of discovering the most accessible, low-tech approach to building ETL pipelines in a large organization. Along the way, we’ve made unconventional choices—some brilliant, others with unexpected trade-offs but all in the name of keeping things intuitive and manageable for our diverse teams. Join us to learn about our journey, the lessons we’ve learned, and the tools we’ve found to be surprisingly effective.
Have you ever wondered what secrets the internet hides? In just 15 minutes, I’ll take you on a journey through the world of data scraping—from the Darknet to data leaks and beyond. Discover how our specialised team of analysts gathers critical intelligence on hacks and ransomware activities. We’ve employed unconventional methods to extract valuable insights using Python, making bold choices that have led to both brilliant outcomes and unexpected trade-offs. How do we navigate complex sites with CAPTCHA solvers and ensure everything remains intuitive for our analysts? Join me to uncover the challenges and innovations behind our data scraping journey, and learn how we turn chaos into actionable intelligence.
(no talks, just discussion and refreshments)
At Prague Data Platform, we collect, process, and evaluate data from the city to enhance decision-making and improve the lives of its citizens. From waste management to public transport, this talk will provide an overview of the city’s data landscape and showcase specific areas where data is leveraged to the city’s advantage. I will also discuss how we utilize open-source technologies and adhere to open standards to make our work transparent and accessible to clients and citizens.
DuckDB is a powerful tool for data analysts and data engineers. It is an embedded, in-process database that provides connectors and capabilities comparable to complex query engines like Spark or Presto, yet it’s as simple to set up as SQLite. In this talk, I will provide a brief introduction to the technology and how we leverage it as a simple magic wand for easy, in-process data crunching within the Keboola container architecture. I will walk you through our rather unusual use cases
A joint event of PyData Prague and Open-Source Science (OSSci).
Recap available at Open Source Science Medium.
Open-Source Science (OSSci) is a NumFOCUS initiative – launched in July 2022 in partnership with IBM – that aims to accelerate scientific research by improving the ways open source software in science gets done (built, used, funded, sustained, recognized, etc.). OSSci connects scientists, OSS developers and other stakeholders to share best practices, identify common pain points, and explore solutions together. The five OSSci interest groups to date cover domain-specific topics (chemistry/materials, life sciences/healthcare, climate/sustainability) as well as cross-domain topics (reproducibility, map of science), with more to be rolled out in 2024. This talk will provide a brief overview of OSSci’s activities to date, our plans for 2024, and how you can get involved.
In the British academic system, a new movement was established in 2012 called “research software engineering”. The goal has been to recognise and promote the vital role of software in research and establish academic career paths for people who develop it. As a Vice-President of the Society of Research Software Engineering (and a research software engineer herself), Evelina will talk about some of the lessons learned over the past decade and what have been the challenges in recognising software and open source contributions as fully-fledged academic outputs.
Since the early 2010s Python has been recognised as a powerful option offering an alternative to then-dominant proprietary platforms like IDL or Matlab, or compiled (and often also proprietary) languages like Fortran and C++, for scientific data processing in the astrophysics research community. The need to implement efficient array computation and visualisation led to significant contributions to evolving modules like Numarray/Numpy and Matplotlib from some institutions and many individuals in the community; yet individual needs also started to set off a proliferation of independent solutions. Astropy was created to foster an ecosystem of interoperable astronomy packages, sharing common coding standards and data APIs, to allow and actively encourage contributors throughout the astronomical community to invest their development work into a widely usable and professionally maintained package. A decade later this has made Python+Astropy now the dominant data-processing platform in astrophysical research. But with this growth the project has also evolved from a relatively informal team effort to a more formally organised and structured system.
While academic research heavily depends on open-source software, the relationship is often one-way. We believe that designing research in close relation to open-source development is beneficial for all parties and present one way of doing that, by turning a research project into a component of the open-source ecosystem.
Academic research often depends on open-source software. Still, researchers do not contribute back that often due to the lack of institutional incentives, time demands, or an imposter syndrome (“my code is too messy”). However, open-source software development doesn’t have to be detached from academic work. The first step is a decision to make the code open. Then the question is, how? From an academic standpoint, packing up the functionality into a new package instead of contributing to existing libraries could lead to additional publications that matter in career progress. However, from an open-source standpoint, such an approach widens the ecosystem’s fragmentation and threatens its sustainability. In this talk, we outline why we chose the path benefiting open source over the academic benefits, how we did it and our vision of academic work closely linked to open-source development. We illustrate this approach in our work on the Urban Grammar AI project, combining aspects of radical openness of the process, making research code available as it is written, and enhancing existing libraries when we need new functionality. It led to significant contributions to the GeoPandas and PySAL ecosystems, a release of one independent package with functionality that didn’t fit elsewhere, and further developments of a canonical Docker container for geographic data science.
What is it like to migrate data from on-prem systems to the cloud in one of the largest banks in the Czech Republic? How to check that nothing has changed along the way? Join us to discover our Python-based solution! Not only will we go through the code, but we’ll also focus on the pitfalls, challenges, and hardships we have encountered along the way. We will thoroughly explain how we verify the migrated data, which is exported as CSV from the source system and stored as Parquet in the AWS Data Lake. Expect a candid and emotional presentation with thoughtfully named commits.
Discover the essence of Pandas Enhancement Proposals aka PDEPs and their impact on the landscape of data analysis with Pandas. Join us for an overview of proposed changes, aimed at addressing corner cases and inconsistencies that have emerged over years of development, presented by one of Pandas’ core contributors. From banning (up)casting in “setitem-like” operations to reevaluating inplace option in methods, delve into the behavioural shifts shaping the Pandas roadmap. Gain insight into the evolving copy and view semantics of operations, offering a glimpse into the anticipated Pandas 3.0 experience by overviewing changes in the new version.
In this talk, we will break down a crucial yet often overlooked skill for ML engineers and data scientists: code vectorization, the cornerstone of modern numerical libraries. The aim is to show when and how to apply this technique to significantly boost performance. We will provide practical insight on implementation, discuss pros and cons, and explore the impact on the codebase. Using primarily Python and NumPy, our code examples will demonstrate the portability of vectorized solutions across libraries and languages.
Let’s have a look at how we @AlmaCareer Czechia Business Intelligence team moved JupyterHub and JupyterLab from on-premise infrastructure to AWS. Why we used Amazon Sagemaker Studio for just 3 weeks and why we are happy with Jupyter running on top of Coder (coder.com) in AWS at the end. Infrastructure point of view with deeper dive into pros/cons of on-prem JupyterHub/Lab on Hashicorp Nomad, Amazon Sagemaker Studio and Coder. All this considering the requirements of 20 working users in JupyterLab.
In this talk, we will explore the fundamental principles of bioacoustic monitoring, highlighting the challenges of traditional methods and the breakthroughs enabled by AI. We will introduce BirdNET and explain its development, its capabilities and the machine learning algorithms that enable its performance. In our presentation, we will provide a comprehensive analysis of case studies where BirdNET has been successfully deployed, demonstrating its effectiveness in different ecosystems and its pivotal role in bird species identification and population tracking. In addition, we will discuss the wider implications of AI for ecological research and conservation efforts. The integration of BirdNET into citizen science projects and its impact on public involvement in conservation activities will also be highlighted. The presentation will conclude with an outlook on possible future developments in the field of AI-assisted bioacoustic monitoring and the exciting opportunities this presents for our understanding and protection of the natural world.
In 2023, the spotlight shone on AI, marking a transformative shift from a specialized computer science domain to a widely discussed topic across society. ChatGPT and other Language Models (LLMs) captured global attention. This talk explores the pivotal datasets driving the training of AI models, investigating their background, creation methods, and more. Additionally, we’ll delve into some controversies surrounding existing datasets.
Pracovat s velkými jazykovými modely je mnohem jednodušší než bývalo před pár lety a dnes si svůj generativní AI model může natrénovat doslova každý. Při přechodu z angličtiny k českým a slovenským textům ale spousta z nás naráží na ty samé problémy. Pojďme si o nich vykládat a diskutovat o řešeních: tematicky od těch “menších” velkých jazykových modelů jako je BERT přes multilingual embeddingy a ChatGPT API až po volně dostupné modely jako Llama2 a Mistral, od QLoRA adaptorů až po etické otázky a náladu ve společnosti.
💬 Tento PyData meetup bude mít pro nás netradiční formát panelové diskuze. Vzhledem k tématu bude oficiálním jazykem 🇨🇿 českoslovenština 🇸🇰.
🙌 Pozvání do panelu přijali:
In a data-driven world, the ability to tell stories with data is a superpower. Join us on a journey to unlock this power with Streamlit, an easy-to-use and user-friendly tool for building and sharing interactive web applications for data science and machine learning in a fast way. In this talk, we will dive into the numerous advantages of Streamlit as your trusted companion in the art of data storytelling. We will explore its components and learn how they can effortlessly turn your data into interactive and intuitive visual experiences.
But that’s not all! We’ll also embark on a voyage through a few of real-world use cases where Streamlit shines, from rapid data exploration to prototype machine learning models, and even creating custom, user-friendly data dashboards.
By the end of this talk, you’ll be armed with the knowledge and tools to captivate your audience through the magic of data-driven storytelling. Don’t miss your chance to unleash your inner data storyteller with Streamlit!
General-Purpose large language models (LLMs) such as GPT-3 and Llama 2 have some limitations when coming to do some specific tasks. Retrieval Augmented Generation (RAG) and Fine-Tuning are two powerful techniques for enhancing the reliability of LLMs in organizations.
During this talk, we’ll explore:
This talk is suitable for data analysts, data scientists and LLMs enthusiast who want to have some understanding on the reality of using LLMs in production. Warning! It is not all roses; it gets dirty so fast.
In this talk, our project on sign language recognition will be presented. The aim of the project is to create a prototype app that can facilitate communication between hearing-impaired individuals and those who do not know sign language. The challenges faced by hearing-impaired individuals in communicating with the general population will be discussed, and how technology can help bridge this gap will be explained.
A Python code relying on Mediapipe library will be introduced, and its use in extracting key features from sign language gestures, such as hand and finger positions and movements, will be explained. The machine learning techniques used to recognize and classify these gestures will be delved into. The data preprocessing steps, model selection, and training process will be covered, as well as the evaluation metrics used to measure the accuracy of the model. Finally, the prototype will be showcased, and its operation in real-time will be demonstrated. The limitations of our current approach and potential future developments in the field will also be discussed.
Sktime is a widely used scikit-learn compatible library for learning with time series. sktime is easily extensible by anyone, and interoperable with the pydata/numfocus stack.
This short talk gives an introduction to sktime, the time series learning tasks it addresses, and a summary of usage and interfaces. It further introduces common feature extraction, pipelines and composition building blocks, and explains reduction patterns in sktime – using an estimator or object for one learning task to solve another, e.g., a tabular regressor for forecasting.
This is a basic introduction and an overview talk, no prior knowledge is required.
I’ll demonstrate how to write a Python module in Rust. What’s more, the module will communicate both ways between Python and Rust using asyncio and tokio. After the talk, you’ll be able to write an app that asynchronously processes data in Rust without holding the GIL. The main advantage of this approach as opposed to microservices is less boilerplate and saving the IPC overhead.
During this session, we will present how we built and open-sourced a Python SDK for our analytics APIs covered by OpenAPI specification. We will share best practices applied on Python projects (data classes, services, tests, documentation, Github actions, pip, …). In the end, we will conduct a live demo - Github APIs -> PostgreSQL -> dbt -> GoodData -> Deepnote(GoodData Python SDK).
During the last eight years of deep neural networks rewriting the landscape of machine translation (MT), we got to the stage where MT is clearly usable and progress hard to measure for language pairs covered well with data. Let us now apply similar technologies to the task of automatic creation of minutes from project meetings (“minuting”). With a similar devotion, we should make automatic minutes usable in a couple of years. And perhaps machine understanding will be necessary at last; to my surprise, it was not needed for MT.
“Everything is related to everything else, but near things are more related than distant things” That is the first law of geography. But how do we apply it to data science? How do we ensure that our analysis has a spatial dimension and that it can be mapped? How can we combine data based on their location? Are there any spatial patterns? These are the questions you will be able to answer after a gentle introduction to spatial data science in the pandas ecosystem. We will start with a brief explanation of the key concepts like geometries and projections to introduce GeoPandas, a package that brings geo to pandas. Then we’ll check how the ecosystem supporting GeoPandas looks like and what it offers, followed by a short excursion to the realm of spatial analytics using the packages from the PySAL (Python Spatial Analysis Library) project. We will be able to expand traditional exploratory data analysis with spatial using the esda package, interpolate data between different spatial units using the tobler package or analyse the structure of cities using momepy. Finally, when the data begin to look too large to work with, we switch to distributed dask-geopandas and wrap up our journey with spatial operations powered by Dask somewhere in the cloud.
When was the last time you encountered a power outage? You would be a bit concerned if you knew how much of today’s critical infrastructure relies on the uninterrupted supply of electricity. How can drones help? This talk explains how we used drone images to automatically detect one of the most common causes of power outages - when trees come dangerously close to the wires.
The presentation will cover the complete workflow:
Julia is a rather new programming language, designed for modern scientific computing. This talk will compare Julia to Python, discuss its current state and production-readiness, emphasise its strengths and weaknesses, describe the interoperability with Python, and speculate whether it is now a good time to rewrite all your Python projects in Julia - all that with some important takeaways from more than a year of running Julia in production at Avast.
JupyterLite is a JupyterLab distribution that runs entirely in the web browser, backed by in-browser language kernels. This presentation will be a functional talk to present JupyterLite with concrete examples and live demos.
Epidemiological modeling helps us to understand the dynamics of disease spread and the effects of various protective measures. Agent-based models provide simulation tools for detailed modeling of individual human behavior. We will present a general network agent model that has been used to study epidemic scenarios in various environments, including a typical Czech county, or a school.
No talks.
Jupyter notebook is now one of the most popular tools for data scientists, even though it is fairly difficult to work with it in a team setting. In this talk, we will first explore how notebooks work under the hood, then we will discuss how we can build collaboration features to enable real-time editing (like google docs) and finally we will address some security challenges inherent to having collaborative data science tools in the cloud.
The Xarray library has in recent years become one of the de-facto standards for working with multi-dimensional datasets in Python. While calling it “a generalization of Pandas into multiple dimensions” gives a reasonable first impression, there is much more to it than that. For instance, the transparent and well structured API offers a concise handle on the depths of NumPy and Dask broadcasting magic. The API and helper functions also enable the construction of convenient and versatile wrappers of e.g. Scipy routines which then become applicable in any domain where data can be represented by Xarray containers. The talk will showcase the basic Pandas-like usage as well as some of the aforementioned advanced features.
No talks.
Automation and artificial intelligence can bring improvements in customer experience, shorten application time and simplify the underwriting process, reduce costs and replace some tedious and repetitive human labor such as damage inspection or claim processing. How does it work in the world of finance, especially insurance and banking, both examples of the most traditional industries, posing strong resistance towards changes?
Possible future development in the industry will be demonstrated on two recent data science and machine learning projects:
The pandas library offers three approaches to user customization – class inheritance, series/data frame accessors, and extension arrays/dtypes. The talk introduces all of them shortly with real-world examples and code, while focusing on a complete implementation of physical-unit-aware data column example.
The Automotive industry is one of the most important drivers (pun intended) of the Czech economy. With more than 1,4 million cars produced per year, 800 companies involved and 160 000 direct employees, Automotive is considered the largest industry in the Czech Republic accounting for more than 9 % of GDP.
With constant flood of buzzwords (IIoT, Industry 4.0, Digital factory 2.0, Autonomous Driving just to name a few), the aim of this talk is to present a more realistic and data-driven perspective of the industry as well as look at the advantages and challenges of Data Science projects within this environment.
Your data science project may need to produce interactive tools and visualizations to allow end-users to explore data and results. Dash, a project by the team that makes Plotly, solves some of these problems by allowing data scientists to build rich and interactive websites in pure Python, with minimal knowledge of HTML and absolutely no Javascript.
This talk will give an overview of Dash, how it works and what it can be used for, before outlining some of the common problems that emerge when data scientists are let loose to produce web applications, and web developers have to work with the pydata ecosystem.
Searching for job is a struggle that LMC is trying to simplify. Therefore, I would like to introduce a practical overview of how data science is applied at LMC to achieve such a goal. We do so by applying data science to better understand our users experience and abilities using techniques of Natural Language Processing (NLP). Moreover, we examine current preferences of our users using Recommender System in order to recommend them the most relevant job options and shorten the path to their next career.
Knowledge of the neighborhoods may identify deficient services to a marketing company, it also enables you to address your spouse’s inquiry about points of interest on a day-trip. To acquire such knowledge, you need a rich source of data with decent API. OpenStreetMaps offer both. This talk explains how to access the data points in OSM and how to query their labels and content using two Python libraries (overpass and overpy) as well as the underlying OSM query language.
This talk will give you a short tour of the Czech Centre for Phenogenomics in the BIOCEV centre in Vestec and a brief overview of current research in mouse-based functional genomics. Karla will also present various types of data we generate during the research and will show you how Python helps us to overcome some of the everyday challenges we face. Karla Fejfarová is a biostatistician at the Czech Centre of Phenogenomics.
They work in all kinds of companies from startups to corporates. They could be sitting right next to you; the UX designers. But what is their actual job and how could you help each other? Let’s find out with Tomáš Muchka, a senior UX designer at GoodData.
Elasticsearch is an open source distributed datastore, we will see what makes it tick, how to use it from Python and try to draw some conclusions as to where it might fit in in your organization. Honza Král is a principal consulting architect from Elastic and a long-term core committer to Django.
Jakub Urban, a senior Pythonista from Quantlane with rich experience in scientific computing and modelling, will show various possibilities for making your (mostly) numerical calculations in Python fast. He will cover optimization and parallelization using Numpy, Numba, Cython or Dask. You will learn that Python can be as fast as Fortran with a very little effort. In case it cannot, you will see how to seamlessly turn Fortran/C/C++ into a Python module.
Ian Ozsvald, a London-based data scientist with 15+ years of experience, he co-founded PyData London, one of the largest PyData communities in the world. Ian will introduce PyData, NumFOCUS and the community behind these initiatives. He’ll then talk about successfully delivering a data science project.
Štěpán Roučka, a contributor to the SymPy project, a computer algebra system widely used within academia and industry, both as a standalone tool and as part of other scientific packages. Štepán will give a brief overview of SymPy’s functionality and will show you a few examples, which may motivate you to add SymPy to your data science toolbox.