pydata.cz

PyData Prague

Contact:

info@pydata.cz

Social networks:

Meetup - Find and register to our meetups here!
LinkedIn - We are here…
Bluesky - … and here…
Facebook - …and here!

PyData Prague is a community of data scientists, engineers, analysts, and various other developers in the area of scientific computing and data analysis. The term PyData refers to an educational program of NumFOCUS, an american non-profit helping open source software in terms of governance, financial support, and operations.

The PyData network hosts meetups in hundreds of cities around the world and several conferences each year. The Prague chapter started in 2018 with the aim of spreading the word of open source scientific computing in the Czech Republic. And while the chapter is based in Prague, we operate and collaborate countrywide.

Code of Conduct

We adhere to PyData’s code of conduct, here’s its short version:

Be kind to others. Do not insult or put down others. Behave professionally. Remember that harassment and sexist, racist, or exclusionary jokes and language are not appropriate for PyData.

All communication should be appropriate for a professional audience including people of many different backgrounds. Sexual language and imagery is not appropriate.

PyData is dedicated to providing a harassment-free event experience for everyone, regardless of gender, sexual orientation, gender identity, and expression, disability, physical appearance, body size, race, or religion. We do not tolerate harassment of participants in any form.

Thank you for helping make this a welcoming, friendly community for all.

You can find more information at pydata.org/code-of-conduct/

We have hosted several meetups, you can check them out on our Meetup page. We try and host speakers from various backgrounds, so that our attendees get to find out about all sorts of things possible with the NumFOCUS toolkit (and beyond!).

Past Meetups

PyData Prague #35 - Probably unreliable vulnerabilities (26.5.2026 at Aisle)

Stanislav Fort - What a Single-File LLM Security Analyzer Taught Us?

High-quality AI security research can uncover real vulnerabilities in critical infrastructure. AISLE is one example of this higher-signal approach, with validated findings in projects like OpenSSL and curl. At the same time, low-quality AI-generated reports are flooding open-source maintainers with false positives.

How hard is it to find a security bug? We will explore that question through nano-analyzer, a deliberately simple open-source security scanner. For many vulnerability classes, the surprising core is not a complex platform, but a well-aimed LLM call wrapped in the right workflow.

This simplicity has limits. The approach may miss obvious issues, hallucinate risky findings, or produce inconsistent results across runs. That is why validation, triage, benchmarking, and human judgment matter, and why the real challenge is building reliable processes around unreliable primitives.

Marcela Brichtová Piptová - Getting reliable text when PDFs lie and OCR fails (video)

LLMs need text as an input. So before a model can reason about a document, we have to read the text, a step often treated as the “easy part” or a solved problem. But is it?

In this talk, we will explore the hidden complexities of text extraction. This is especially critical for models like Rossum’s T-LLM, an encoder-only architecture which heavily relies on high-quality input. You will learn why transactional documents are sometimes surprisingly hard for OCR, why you can’t always just copy-paste text from a PDF, and why text extraction is still a topic for Rossum researchers (and our customer support team).

PyData Prague #34 - Learning from Distillation (10.3.2026 at Similarweb)

Alena Pavlova - Why Your Models Fail: Learning from Regime Shifts in Real Data

When building predictive models, we often assume that the data-generating process is stable over time. In practice, this assumption is frequently violated and ignoring it can severely degrade model performance.

In this talk, we use stock market price data to demonstrate how overlooking market regimes leads to misleading results and fragile trading strategies. We start by implementing a simple strategy under the assumption of a single, stationary regime and show how it fails in changing market conditions.

We then introduce a regime-aware approach to infer latent market states. By applying the same strategy only within appropriate regimes, we demonstrate how performance, risk characteristics, and interpretability can change dramatically.

The talk focuses on practical Python implementation, intuitive explanations of regime modeling, and concrete lessons that extend beyond finance to any non-stationary time-series problem. Attendees will learn how to detect regime shifts, incorporate them into modeling workflows, and avoid common pitfalls when working with evolving data.

Gabriela Kadlecová - Training Small Language Models with Knowledge Distillation (video)

Large language models (LLMs) have proven useful across a wide range of real-world applications, from code generation and document analysis to intelligent automation and data extraction. However, their practical adoption comes with some tradeoffs: inference costs can be prohibitive at scale, and sending sensitive data to external cloud APIs raises privacy concerns. Sometimes, a general-purpose model is more than what’s needed - if the task is quite specific, a smaller, specialized model could do the job just as well.

Small Language Models (SLMs) are a great alternative - running entirely on local hardware, they keep data private, respond faster, and operate at a fraction of the cost. The key challenge is making them good enough for the task at hand, and this is where knowledge distillation comes in. By having a large “teacher” model automatically generate and refine training examples, we can create a specialized “student” model tailored to a specific task, starting from as few as a handful of seed examples.

In this talk, we walk through the full distillation pipeline: from defining a task and preparing seed data, through synthetic data generation, to fine-tuning an SLM ready for local deployment. We also present benchmark results comparing base and fine-tuned models, and a live demo showing how capable a well-distilled small model can be.

PyData Prague #33 - The Root of All Eval (11.2.2026 at Pure Storage)

Šimon Podhajský - Evals, Benchmarks, and Guardrails: A Pythonista’s Guide to Not Mixing Them Up

“I’ll just write pytest tests for my LLM”—but should you? This talk untangles benchmarks, evals, and guardrails: three concepts that sound similar but map to different Python patterns. Learn why pytest CAN work for evals (with the right mindset), why guardrails aren’t tests at all, and a grounded theory approach to defining what “good” actually means for your task.

Ondřej Hlaváč - Orchestration Beyond the Schedule: Real-Time Integrations with Prefect

As Python data workflows grow in complexity, relying on simple scripts and cron jobs often leads to “silent failures” and a lack of visibility. Enter Prefect—a modern orchestration framework that empowers developers to build, observe, and manage robust pipelines using standard Python code. While Prefect is rapidly gaining traction for scheduled batch processing, we took a different path: using it to power real-time event integrations for Master Data Management.

In this talk, I will introduce what makes Prefect unique compared to legacy tools and demonstrate how we adapted it to handle immediate, event-driven flows. We will explore how its built-in resilience capabilities—like automatic retries, state management, and detailed observability—can be repurposed to make real-time integrations as reliable as nightly ETL jobs

PyData Prague #32 - Scrapeyard Forge (26.11.2025 at Apify)

Vladimír Dušek - Dealing with today’s web scraping challenges in Python (video)

Web data powers today’s AI revolution, but accessing it is becoming increasingly complex as websites grow more sophisticated. Modern web scraping and automation face challenges such as IP and geographical blocking, JavaScript-based content rendering, device and browser fingerprinting, CAPTCHAs, anti-scraping HTTP headers, and more.

In this talk, you’ll learn how to beat these challenges using Crawlee—a modern open-source Python library for web scraping and automation that we built from scratch at Apify.

Jan Kislinger - Forged in Rust, Spoken in Python (slides, video)

Python has long been the go-to language for data science and rapid experimentation, but when our models and algorithms start to hit performance limits, we naturally look toward something closer to the metal. In recent years, Rust has become a powerful partner: a language that offers high speed, strong safety guarantees, and a surprisingly pleasant developer experience. With tools like pyo3 and maturin, we can implement performance-critical components in Rust while keeping the flexible, expressive Python interface we love. This talk explores how the evolving Python–Rust ecosystem is gently reshaping the balance between productivity and performance. What new capabilities do we unlock by going lower level, and what, if anything, do we leave behind?

PyData Prague #31 - PyData Prague meets TechMeetup Ostrava (15.10.2025 at Impact Hub Ostrava)

Darius Kryszczuk - Kill GIL: How Python 3.14 Changes Concurrent Programming

Python 3.14 introduces the long-awaited ability to disable the Global Interpreter Lock (GIL). Although this feature is still experimental, it has the potential to fundamentally reshape concurrent programming in CPython. This talk will explore the implications of GIL removal, focusing on how it enhances parallelism, the performance improvements for multi-threaded applications, and the challenges developers may encounter as they adapt to this new paradigm.

Martina Zátopková - How Do We Read the Genome?

The genome is the complete set of genetic information stored in the DNA molecule, which serves as a template for the development and functioning of an organism. In this lecture, we will explore the principles of DNA sequencing from classical methods to modern platforms as well as the bioinformatics approaches used to transform raw data into meaningful insights that can be applied, for example, in medicine or evolutionary biology.

Alena Martinková & Emanuel Dopater - How Rankacy Analyzes Millions of Games - Python at the Heart of Esports Data

Python is at the core of our analytics platform, which processes over 8,000 game records daily, each approximately 500 MB in size. Over the past two years, we have accumulated more than 200 TB of data, equivalent to 1,600 years of game time from over 7 million players—and our goal is to increase this user count tenfold.

This talk will cover how we transitioned from Go and C++ parsers connected via PyBind to data frames in Python, how our analyses evolved from Pandas to Polars, and why we migrated our backend from Django to FastAPI. Finally, we will share our real-world experience with performance optimization, leveraging RabbitMQ, Redis, and process monitoring in an environment where Python bridges the worlds of game data and AI analysis.

Lumír Balhar - Random, but not really

How computers generate numbers for different purposes Ever wondered how your computer decides what’s “random”? Let’s peek behind the curtain — and see why getting it wrong can be disastrous.

PyData Prague #30 - Pro: To Type Asynchronously (24.9.2025 at Qminers)

Daria Bilan - Prototype: A framework that lets analysts focus on ideas, not code

We’ve streamlined the classic loop — generate data, train a model, report results — into a single, user-friendly toolkit that handles the unglamorous parts of experiments: parallelism, caching, storage, and report generation. We’ll focus on the analyst experience and how small UX choices compound into faster iteration.

Jakub Urban - Async to Distributed: Concurrency Patterns for Data Processing (video)

How to make your Python data workflows scale and run faster? We will explore built‑in concurrent.futures and asyncio patterns, and then show if and how to scale using frameworks like Dask or Ray. We will focus on practical recipes for CPU (or GPU) and memory demanding tasks, typical for data processing.

PyData Prague #29 - Summer Edition (28.7.2025 at Elpíčko)

No talks.

PyData Prague #28 - Terminal Request (9.7.2025 at Prusa Research)

Jan Pipek - Data wrangling in a modern terminal (slides, video)

Once we constrain ourselves to a rectangle of fixed-width characters (preferably white on a black background), we start to see the world a bit differently. If we want to thoroughly investigate it (a.k.a. perform data analysis), we have to be equipped with appropriate tools - be it techniques, libraries or standalone console-based applications. Let’s see what the terminal has to offer when reading, manipulating, presenting and even plotting numerical data. We might even finish with a live dashboard your audience will love (or perhaps will not).

Jakub Zikl - Right-Sized Scaling: Python APIs at Billions of Requests Without the Complexity (video)

At Printables.com, we handle billions of requests every month using a fairly simple, Python-based API stack that scales reliably without unnecessary complexity. In this talk, I’ll share how embracing pragmatism over hype helped us avoid overengineering—proving that microservices and complex architectures aren’t always the answer for every challenge. We’ll explore key design choices, real-world bottlenecks, and practical lessons from our journey to build a maintainable, cost-effective system that delivers at scale. Whether you’re growing a startup or managing a mature platform, you’ll gain actionable insights for scaling Python APIs with confidence.

PyData Prague #27 - LLMs Anonymous (10.4.2025 at Miton)

Katharine Jarmul - Anonymization: Why is it so hard? (video)

In this talk, we’ll look at the problem of anonymizing data and pick apart several common mistakes people make when attempting to remove or modify sensitive data in order to anonymize it. We’ll review basic approaches like pseudonymization and redaction, and then look at more advanced approaches like k-anonymization/aggregation and differential privacy. Expect to think a bit like a hacker and a data scientist to see if you can imagine ways to defeat these approaches, and explore how intertwined information theory is to privacy work.

Štěpán Procházka - LLMs, the do-it-yourself edition (video)

How Rossum researched, trained and deployed their very own T-LLM to process millions of documents a month. #huggingface #pytorch-lightning #nvidia-triton #pgvector #vllm

PyData Prague #26 - Table Diffusion (20.3.2025 at MSD)

Zdeněk Morávek - Defect detection in X-ray images of solid tablets. Data augmentation with Stable diffusion

Data augmentation is a standard method applied to improve the training of supervised machine learning systems. It performs transformation of existing data such as rotations, clipping, scaling etc. The method proved useful, still there are some treats of the original data that affect the efficiency and scalability of the augmentation.

Generative AI allows to create synthetic data from original dataset. The synthesis is virtually limitless and the synthetic data does not share any treats with the original data. This makes it a powerful extension for data augmentation, especially if the original dataset is limited. There is still a question whether the synthetic data represents well the original dataset.

We applied generative algorithm of stable diffusion to generate synthetic cracks in solid state tablets. The dataset is limited in size and the cracks are a low contrast objects with variable properties. We developed a Mask R-CNN classifier and trained it with available dataset as a baseline model.

We selected suitable images for training the stable diffusion generator and created a synthetic dataset. We investigated statistics of pixel properties of the real and synthetic datasets showing that the main features are conserved though details differ. We used the synthetic data to train an alternative model and compare its performance to the baseline. We demonstrated that in terms of accuracy, we can achieve improvement, but on the other hand we observed higher false positive ratio and also reduced applicability to qualitatively different datasets. We discuss reasons behind these observations and how to improve on them.

Ivona Krchová - AI-Generated Tabular Synthetic Data: What It Is, How It’s Created, and Its Applications (video)

Synthetic data has become an important tool in data science, offering a way to generate realistic data while preserving privacy. In this talk, we’ll explore AI-generated tabular synthetic data—what it is, how it’s created, and how it can be used effectively in various contexts.

I’ll begin with a short overview of synthetic data, explaining its key concept and how it differs from traditional data anonymization techniques. Next, I’ll briefly describe the algorithm developed by MOSTLY AI for generating tabular synthetic data. Finally, we’ll explore key use-cases, we’ll discuss how synthetic data can be used to enhance datasets, address missing values or mitigate bias in model outcomes.

PyData Prague #25 - Moon Time Anomaly (4.2.2025 at similarweb)

Lucie Blechová - Demystifying Anomaly Detection: A Practical Guide for Time Series Data (video)

In this talk, we’ll dive into the essentials of unsupervised anomaly detection for multivariate time series data. As organizations rely increasingly on proactive monitoring, anomaly detection has become vital across domains—from fraud detection to predictive maintenance. This session will cover key challenges in anomaly detection, including data drift, dimensionality, and lack of labeled data. We’ll explore foundational techniques and algorithms, from traditional statistical methods to state-of-the-art machine learning models, offering guidance on selecting tools, setting evaluation metrics, and applying these concepts in real-world scenarios.

Juan Luis Cano Rodríguez - To the Moon and back: Lessons learned from archiving my dream open source project (video)

I developed poliastro, a Python library for interactive Astrodynamics, for almost a decade. Seemingly everything was going well, the project had traction, GitHub stars were going up, I got a regular stream of grant money… but at some point I decided to step down as a maintainer. What happened?

PyData Prague #24 - The Large, the Weed and the Compliant (3.12.2024 at Creative Dock)

Adam Hruška - Processing problematic plants with python (video)

Weed detection techniques are essential in modern precision agriculture, where accurately detecting and identifying species can lead to effective crop yields and sustainable resource use. Python has emerged as a powerful tool in this domain, providing versatile libraries and frameworks for data gathering, processing, and model training. By leveraging Python’s capabilities, we can efficiently manage large-scale datasets, preprocess images, and potentially automate the annotation process, streamlining the development of machine learning models for weed detection. The integration of computer vision and machine learning tools, including OpenCV, scikit-image, and TensorFlow, serves as a basis for creation of models capable of distinguishing between crops and weed species with high precision. As a result, Python-driven weed detection models offer a promising path toward improved crop health, resource conservation, and sustainable farming practices.

Soheyla Mirshahi, Jan Kryštůfek - Beyond Accuracy: Engineering EU-Compliant LLM Systems (video)

Building production-ready LLM systems is challenging, but designing solutions that comply with the EU AI Act introduces layers of complexity far beyond prompt engineering and API integration. This talk takes you through the journey of creating a compliant LLM system for one of the most demanding domains: AI-driven recruitment. We share our journey from a naive PoC to a production ready solution under the AI Act. Through this case study, we will explore:

Compliance-first approach: Designing LLM systems to meet security, legal and ethical standards from the ground up
Taming LLM in Python: Using prompts, function calls and scoring model to get results in structured and auditable form
Evaluation: Testing on synthetic data, human evaluation and user feedback
Numbers: Statictics from production and percieved benefits

Attendees will leave equipped with practical tools, architectural patterns for building LLM systems that meet both engineering and regulatory challenges, supported by real-world example.

PyData Prague #23 - Low Agent Intelligence Threat (7.11.2024 at Rapid7)

Šimon Podhajský - Multi-Agent Frameworks: Teaming Up Specialized AI Models to Tackle Complex Tasks (video)

What if we could assign different aspects of a problem to AI models that specialize in those areas, much like delegating tasks to skilled team members? Multi-agent systems (MAS) are enabling this kind of collaboration among computer programs. In this session, we’ll introduce you to LangGraph and Autogen—two leading frameworks that organize AI models into teams of specialists. We’ll explore how each tool structures these teams, manages communication, and measures performance, helping you choose the right solution for various projects.

Martin Votruba - Low-Tech ETL (video)

Is it possible to use git as a database? Yes, we do that. It’s both awesome and terrible. What other weird choices have we made on our way? Come and listen to our story of discovering the most accessible, low-tech approach to building ETL pipelines in a large organization. Along the way, we’ve made unconventional choices—some brilliant, others with unexpected trade-offs but all in the name of keeping things intuitive and manageable for our diverse teams. Join us to learn about our journey, the lessons we’ve learned, and the tools we’ve found to be surprisingly effective.

Dominique Alexander Piatti - Peek into Threat Intelligence (video)

Have you ever wondered what secrets the internet hides? In just 15 minutes, I’ll take you on a journey through the world of data scraping—from the Darknet to data leaks and beyond. Discover how our specialised team of analysts gathers critical intelligence on hacks and ransomware activities. We’ve employed unconventional methods to extract valuable insights using Python, making bold choices that have led to both brilliant outcomes and unexpected trade-offs. How do we navigate complex sites with CAPTCHA solvers and ensure everything remains intuitive for our analysts? Join me to uncover the challenges and innovations behind our data scraping journey, and learn how we turn chaos into actionable intelligence.

PyData Prague #22 - Summer Special Edition (28.8.2024 at Golemio)

(no talks, just discussion and refreshments)

PyData Prague #21 - Duck Hyping (17.6.2024 at Keboola)

František Kaláb - How Data Improves Your Life in Prague (video, slides)

At Prague Data Platform, we collect, process, and evaluate data from the city to enhance decision-making and improve the lives of its citizens. From waste management to public transport, this talk will provide an overview of the city’s data landscape and showcase specific areas where data is leveraged to the city’s advantage. I will also discuss how we utilize open-source technologies and adhere to open standards to make our work transparent and accessible to clients and citizens.

David Ešner - DuckDB Intro with Case Study from Keboola

DuckDB is a powerful tool for data analysts and data engineers. It is an embedded, in-process database that provides connectors and capabilities comparable to complex query engines like Spark or Presto, yet it’s as simple to set up as SQLite. In this talk, I will provide a brief introduction to the technology and how we leverage it as a simple magic wand for easy, in-process data crunching within the Keboola container architecture. I will walk you through our rather unusual use cases

Open Source Science @ PyData Prague #20 (16.5.2024 at FNSPE CTU)

A joint event of PyData Prague and Open-Source Science (OSSci).

Recap available at Open Source Science Medium.

Tim Bonnemann - Accelerating Science with Open Source – An Introduction to Open-Source Science (OSSci) (slides)

Open-Source Science (OSSci) is a NumFOCUS initiative – launched in July 2022 in partnership with IBM – that aims to accelerate scientific research by improving the ways open source software in science gets done (built, used, funded, sustained, recognized, etc.). OSSci connects scientists, OSS developers and other stakeholders to share best practices, identify common pain points, and explore solutions together. The five OSSci interest groups to date cover domain-specific topics (chemistry/materials, life sciences/healthcare, climate/sustainability) as well as cross-domain topics (reproducibility, map of science), with more to be rolled out in 2024. This talk will provide a brief overview of OSSci’s activities to date, our plans for 2024, and how you can get involved.

Evelina Gabašová - Open source and academia: Research software engineering perspective (slides)

In the British academic system, a new movement was established in 2012 called “research software engineering”. The goal has been to recognise and promote the vital role of software in research and establish academic career paths for people who develop it. As a Vice-President of the Society of Research Software Engineering (and a research software engineer herself), Evelina will talk about some of the lessons learned over the past decade and what have been the challenges in recognising software and open source contributions as fully-fledged academic outputs.

Derek Homeier - Astropy – a community effort to develop a common core package for Astronomy in Python (slides)

Since the early 2010s Python has been recognised as a powerful option offering an alternative to then-dominant proprietary platforms like IDL or Matlab, or compiled (and often also proprietary) languages like Fortran and C++, for scientific data processing in the astrophysics research community. The need to implement efficient array computation and visualisation led to significant contributions to evolving modules like Numarray/Numpy and Matplotlib from some institutions and many individuals in the community; yet individual needs also started to set off a proliferation of independent solutions. Astropy was created to foster an ecosystem of interoperable astronomy packages, sharing common coding standards and data APIs, to allow and actively encourage contributors throughout the astronomical community to invest their development work into a widely usable and professionally maintained package. A decade later this has made Python+Astropy now the dominant data-processing platform in astrophysical research. But with this growth the project has also evolved from a relatively informal team effort to a more formally organised and structured system.

Martin Fleischmann - Open by Default: Developing reproducible, computational research (slides)

While academic research heavily depends on open-source software, the relationship is often one-way. We believe that designing research in close relation to open-source development is beneficial for all parties and present one way of doing that, by turning a research project into a component of the open-source ecosystem.

Academic research often depends on open-source software. Still, researchers do not contribute back that often due to the lack of institutional incentives, time demands, or an imposter syndrome (“my code is too messy”). However, open-source software development doesn’t have to be detached from academic work. The first step is a decision to make the code open. Then the question is, how? From an academic standpoint, packing up the functionality into a new package instead of contributing to existing libraries could lead to additional publications that matter in career progress. However, from an open-source standpoint, such an approach widens the ecosystem’s fragmentation and threatens its sustainability. In this talk, we outline why we chose the path benefiting open source over the academic benefits, how we did it and our vision of academic work closely linked to open-source development. We illustrate this approach in our work on the Urban Grammar AI project, combining aspects of radical openness of the process, making research code available as it is written, and enhancing existing libraries when we need new functionality. It led to significant contributions to the GeoPandas and PySAL ecosystems, a release of one independent package with functionality that didn’t fit elsewhere, and further developments of a canonical Docker container for geographic data science.

PyData Prague #19 - Pandas in Heaven (23.4.2024 at ČSOB)

Jakub Kramata - Big Bank Data in Migration: From In-House CSV to Parquet in Amazon S3 (video)

What is it like to migrate data from on-prem systems to the cloud in one of the largest banks in the Czech Republic? How to check that nothing has changed along the way? Join us to discover our Python-based solution! Not only will we go through the code, but we’ll also focus on the pitfalls, challenges, and hardships we have encountered along the way. We will thoroughly explain how we verify the migrated data, which is exported as CSV from the source system and stored as Parquet in the AWS Data Lake. Expect a candid and emotional presentation with thoughtfully named commits.

Hadi Abdi Khojasteh - Pandas Roadmap and Beyond (video)

Discover the essence of Pandas Enhancement Proposals aka PDEPs and their impact on the landscape of data analysis with Pandas. Join us for an overview of proposed changes, aimed at addressing corner cases and inconsistencies that have emerged over years of development, presented by one of Pandas’ core contributors. From banning (up)casting in “setitem-like” operations to reevaluating inplace option in methods, delve into the behavioural shifts shaping the Pandas roadmap. Gain insight into the evolving copy and view semantics of operations, offering a glimpse into the anticipated Pandas 3.0 experience by overviewing changes in the new version.

PyData Prague #18 - A Vector from Lab to Hub (29.2.2024 at Pure Storage)

Milan Ondrašovič - Unlocking Efficiency: The Power of Vectorization (video, slides)

In this talk, we will break down a crucial yet often overlooked skill for ML engineers and data scientists: code vectorization, the cornerstone of modern numerical libraries. The aim is to show when and how to apply this technique to significantly boost performance. We will provide practical insight on implementation, discuss pros and cons, and explore the impact on the codebase. Using primarily Python and NumPy, our code examples will demonstrate the portability of vectorized solutions across libraries and languages.

Jakub Hettler - Jupyter(Hub/Lab): Journey from On-prem to AWS (video)

Let’s have a look at how we @AlmaCareer Czechia Business Intelligence team moved JupyterHub and JupyterLab from on-premise infrastructure to AWS. Why we used Amazon Sagemaker Studio for just 3 weeks and why we are happy with Jupyter running on top of Coder (coder.com) in AWS at the end. Infrastructure point of view with deeper dive into pros/cons of on-prem JupyterHub/Lab on Hashicorp Nomad, Amazon Sagemaker Studio and Coder. All this considering the requirements of 20 working users in JupyterLab.

PyData Prague #17 - One Chirped over the Large Data Nest (24.1.2024 at Apify)

Stefan Kahl & Josef Haupt - AI-powered bioacoustic monitoring with BirdNET (video)

In this talk, we will explore the fundamental principles of bioacoustic monitoring, highlighting the challenges of traditional methods and the breakthroughs enabled by AI. We will introduce BirdNET and explain its development, its capabilities and the machine learning algorithms that enable its performance. In our presentation, we will provide a comprehensive analysis of case studies where BirdNET has been successfully deployed, demonstrating its effectiveness in different ecosystems and its pivotal role in bird species identification and population tracking. In addition, we will discuss the wider implications of AI for ecological research and conservation efforts. The integration of BirdNET into citizen science projects and its impact on public involvement in conservation activities will also be highlighted. The presentation will conclude with an outlook on possible future developments in the field of AI-assisted bioacoustic monitoring and the exciting opportunities this presents for our understanding and protection of the natural world.

Jiří Moravčík - The data behind the success of (not only) large language models (video)

In 2023, the spotlight shone on AI, marking a transformative shift from a specialized computer science domain to a widely discussed topic across society. ChatGPT and other Language Models (LLMs) captured global attention. This talk explores the pivotal datasets driving the training of AI models, investigating their background, creation methods, and more. Additionally, we’ll delve into some controversies surrounding existing datasets.

🇨🇿 Velké jazykomodely české a slovenské (15.12.2023 at MFF UK)

Pracovat s velkými jazykovými modely je mnohem jednodušší než bývalo před pár lety a dnes si svůj generativní AI model může natrénovat doslova každý. Při přechodu z angličtiny k českým a slovenským textům ale spousta z nás naráží na ty samé problémy. Pojďme si o nich vykládat a diskutovat o řešeních: tematicky od těch “menších” velkých jazykových modelů jako je BERT přes multilingual embeddingy a ChatGPT API až po volně dostupné modely jako Llama2 a Mistral, od QLoRA adaptorů až po etické otázky a náladu ve společnosti.

💬 Tento PyData meetup bude mít pro nás netradiční formát panelové diskuze. Vzhledem k tématu bude oficiálním jazykem 🇨🇿 českoslovenština 🇸🇰.

🙌 Pozvání do panelu přijali:

Jindřich Libovický (ÚFAL MFF UK)
Filip Uhlarik (Gerulata Technologies, STU v Bratislave)
Petr Šimeček (CEITEC MU, Mediaboard)
Jakub Náplava (Seznam.cz)

PyData Prague #15 - Augmented Storyteller (20.11.2023 at Ataccama)

Furkan M. Torun - Become a Data Storyteller with Streamlit! (video)

In a data-driven world, the ability to tell stories with data is a superpower. Join us on a journey to unlock this power with Streamlit, an easy-to-use and user-friendly tool for building and sharing interactive web applications for data science and machine learning in a fast way. In this talk, we will dive into the numerous advantages of Streamlit as your trusted companion in the art of data storytelling. We will explore its components and learn how they can effortlessly turn your data into interactive and intuitive visual experiences.

But that’s not all! We’ll also embark on a voyage through a few of real-world use cases where Streamlit shines, from rapid data exploration to prototype machine learning models, and even creating custom, user-friendly data dashboards.

By the end of this talk, you’ll be armed with the knowledge and tools to captivate your audience through the magic of data-driven storytelling. Don’t miss your chance to unleash your inner data storyteller with Streamlit!

Soheyla Mirshahi - Retrieval Augmented Generation (RAG) In Practice

General-Purpose large language models (LLMs) such as GPT-3 and Llama 2 have some limitations when coming to do some specific tasks. Retrieval Augmented Generation (RAG) and Fine-Tuning are two powerful techniques for enhancing the reliability of LLMs in organizations.

During this talk, we’ll explore:

The Pros and Cons: Dive deep into the advantages and potential pitfalls of RAG and Fine-Tuning.
Decision Factors: Understand the context that might lead you to choose one method over the other, or even integrate both.
A Real-World Example: A step-by-step walkthrough of how we employed RAG to address issues like outdated information, domain knowledge gaps, hallucinations, and other common challenges of LLMs.
Challenges and Resolutions: reveal the bottlenecks we encountered – be it data quality, retrieval efficiency, model intricacies, or evaluation standards – and the strategies weadopted to navigate them.
Optimization Tips: Practical advice and best practices to optimize your RAG deployment. By the end of this talk, you will have a clear understanding of if the RAG is the best optionfor you, if yes, how to do it in practice and what challenges to expect and how to address them.

This talk is suitable for data analysts, data scientists and LLMs enthusiast who want to have some understanding on the reality of using LLMs in production. Warning! It is not all roses; it gets dirty so fast.

PyData Prague #14 - Voice upon a Time (3.5.2023 at Heureka)

Karel Boháček - Sign language recognition: Enabling communication for the hearing-impaired through machine learning (video)

In this talk, our project on sign language recognition will be presented. The aim of the project is to create a prototype app that can facilitate communication between hearing-impaired individuals and those who do not know sign language. The challenges faced by hearing-impaired individuals in communicating with the general population will be discussed, and how technology can help bridge this gap will be explained.

A Python code relying on Mediapipe library will be introduced, and its use in extracting key features from sign language gestures, such as hand and finger positions and movements, will be explained. The machine learning techniques used to recognize and classify these gestures will be delved into. The data preprocessing steps, model selection, and training process will be covered, as well as the evaluation metrics used to measure the accuracy of the model. Finally, the prototype will be showcased, and its operation in real-time will be demonstrated. The limitations of our current approach and potential future developments in the field will also be discussed.

Franz Kiraly - A unified interface for machine learning with time series - an introduction

Sktime is a widely used scikit-learn compatible library for learning with time series. sktime is easily extensible by anyone, and interoperable with the pydata/numfocus stack.

This short talk gives an introduction to sktime, the time series learning tasks it addresses, and a summary of usage and interfaces. It further introduces common feature extraction, pipelines and composition building blocks, and explains reduction patterns in sktime – using an estimator or object for one learning task to solve another, e.g., a tabular regressor for forecasting.

This is a basic introduction and an overview talk, no prior knowledge is required.

PyData Prague #13 - In Good We Rust (13.12.2022 at Deepnote)

Ondřej Vostál - Async Python modules in Rust (video)

I’ll demonstrate how to write a Python module in Rust. What’s more, the module will communicate both ways between Python and Rust using asyncio and tokio. After the talk, you’ll be able to write an app that asynchronously processes data in Rust without holding the GIL. The main advantage of this approach as opposed to microservices is less boilerplate and saving the IPC overhead.

Jan Soubusta - How we built a Python SDK for our (open) APIs (video)

During this session, we will present how we built and open-sourced a Python SDK for our analytics APIs covered by OpenAPI specification. We will share best practices applied on Python projects (data classes, services, tests, documentation, Github actions, pip, …). In the end, we will conduct a live demo - Github APIs -> PostgreSQL -> dbt -> GoodData -> Deepnote(GoodData Python SDK).

PyData Prague #12 - Minutes to a Degree (18. 10. 2022 at LMC)

Ondřej Bojar - Machine Translation Usable in Practice, Let’s Move to Minuting

During the last eight years of deep neural networks rewriting the landscape of machine translation (MT), we got to the stage where MT is clearly usable and progress hard to measure for language pairs covered well with data. Let us now apply similar technologies to the task of automatic creation of minutes from project meetings (“minuting”). With a similar devotion, we should make automatic minutes usable in a couple of years. And perhaps machine understanding will be necessary at last; to my surprise, it was not needed for MT.

Martin Fleischmann - A Gentle Introduction to Spatial Data in the Pandas Ecosystem (video)

“Everything is related to everything else, but near things are more related than distant things” That is the first law of geography. But how do we apply it to data science? How do we ensure that our analysis has a spatial dimension and that it can be mapped? How can we combine data based on their location? Are there any spatial patterns? These are the questions you will be able to answer after a gentle introduction to spatial data science in the pandas ecosystem. We will start with a brief explanation of the key concepts like geometries and projections to introduce GeoPandas, a package that brings geo to pandas. Then we’ll check how the ecosystem supporting GeoPandas looks like and what it offers, followed by a short excursion to the realm of spatial analytics using the packages from the PySAL (Python Spatial Analysis Library) project. We will be able to expand traditional exploratory data analysis with spatial using the esda package, interpolate data between different spatial units using the tobler package or analyse the structure of cities using momepy. Finally, when the data begin to look too large to work with, we switch to distributed dask-geopandas and wrap up our journey with spatial operations powered by Dask somewhere in the cloud.

PyData Prague #11 - Hovering over Capulets (27. 6. 2022 at Ataccama)

Radek Ježek - Using drone imagery to detect vegetation around power lines (video)

When was the last time you encountered a power outage? You would be a bit concerned if you knew how much of today’s critical infrastructure relies on the uninterrupted supply of electricity. How can drones help? This talk explains how we used drone images to automatically detect one of the most common causes of power outages - when trees come dangerously close to the wires.

The presentation will cover the complete workflow:

Capturing images using automated drone missions.
Building a 3D model of the terrain.
Using computer vision and neural networks for detecting wires in the images.
Building a 3D representation of the wires with epipolar geometry.
Visualizing the vegetation encroachment.

Matěj Račinský - Julia in Python’s den (video)

Julia is a rather new programming language, designed for modern scientific computing. This talk will compare Julia to Python, discuss its current state and production-readiness, emphasise its strengths and weaknesses, describe the interoperability with Python, and speculate whether it is now a good time to rewrite all your Python projects in Julia - all that with some important takeaways from more than a year of running Julia in production at Avast.

PyData Prague #10 - Pandemic Lite (10. 5. 2022 at Creative Dock)

Jeremy Tuloup - JupyterLite: Jupyter ❤️ WebAssembly ❤️ Python (slides, video)

JupyterLite is a JupyterLab distribution that runs entirely in the web browser, backed by in-browser language kernels. This presentation will be a functional talk to present JupyterLite with concrete examples and live demos.

Roman Neruda, Petra Vidnerová - Tested on agents - how we designed an agent-based epidemiological model (video)

Epidemiological modeling helps us to understand the dynamics of disease spread and the effects of various protective measures. Agent-based models provide simulation tools for detailed modeling of individual human behavior. We will present a general network agent model that has been used to study epidemic scenarios in various environments, including a typical Czech county, or a school.

PyData Prague #9 - Reinforcement Yearning (13. 7. 2021 at Holešovická Šachta)

No talks.

PyData Prague #8 - Collaborative Dimensions (5. 10. 2020, virtual)

Jan Matas - Making data science notebook collaborative (video)

Jupyter notebook is now one of the most popular tools for data scientists, even though it is fairly difficult to work with it in a team setting. In this talk, we will first explore how notebooks work under the hood, then we will discuss how we can build collaboration features to enable real-time editing (like google docs) and finally we will address some security challenges inherent to having collaborative data science tools in the cloud.

Ondřej Grover - Xarray, more than Pandas in multiple dimensions (video, slides)

The Xarray library has in recent years become one of the de-facto standards for working with multi-dimensional datasets in Python. While calling it “a generalization of Pandas into multiple dimensions” gives a reasonable first impression, there is much more to it than that. For instance, the transparent and well structured API offers a concise handle on the depths of NumPy and Dask broadcasting magic. The API and helper functions also enable the construction of convenient and versatile wrappers of e.g. Scipy routines which then become applicable in any domain where data can be represented by Xarray containers. The talk will showcase the basic Pandas-like usage as well as some of the aforementioned advanced features.

PyData Prague #7 - Call for Community (25. 8. 2020 at Klub Avu)

No talks.

PyData Prague #6 - Extended Automation (3. 2. 2020 at CreativeDock)

Adam Hanka - Automation in InsurTech and Banking (video)

Automation and artificial intelligence can bring improvements in customer experience, shorten application time and simplify the underwriting process, reduce costs and replace some tedious and repetitive human labor such as damage inspection or claim processing. How does it work in the world of finance, especially insurance and banking, both examples of the most traditional industries, posing strong resistance towards changes?

Possible future development in the industry will be demonstrated on two recent data science and machine learning projects:

Recognition and evaluation of car damages.
Scoring process based on structured information from social media.

Jan Pipek - A practical guide to designing implants for pandas (slides)

The pandas library offers three approaches to user customization – class inheritance, series/data frame accessors, and extension arrays/dtypes. The talk introduces all of them shortly with real-world examples and code, while focusing on a complete implementation of physical-unit-aware data column example.

PyData Prague #5 - Dashing Automobile (27. 11. 2019 at Microsoft)

Andrej Svitek - Data Science in Automotive industry (video)

The Automotive industry is one of the most important drivers (pun intended) of the Czech economy. With more than 1,4 million cars produced per year, 800 companies involved and 160 000 direct employees, Automotive is considered the largest industry in the Czech Republic accounting for more than 9 % of GDP.

With constant flood of buzzwords (IIoT, Industry 4.0, Digital factory 2.0, Autonomous Driving just to name a few), the aim of this talk is to present a more realistic and data-driven perspective of the industry as well as look at the advantages and challenges of Data Science projects within this environment.

Dom Weldon - Dash: Interactive Data Visualization Web Apps with no Javascript (video, slides)

Your data science project may need to produce interactive tools and visualizations to allow end-users to explore data and results. Dash, a project by the team that makes Plotly, solves some of these problems by allowing data scientists to build rich and interactive websites in pure Python, with minimal knowledge of HTML and absolutely no Javascript.

This talk will give an overview of Dash, how it works and what it can be used for, before outlining some of the common problems that emerge when data scientists are let loose to produce web applications, and web developers have to work with the pydata ecosystem.

PyData Prague #4 - Mapping Science (30. 9. 2019 at (A)void Floating Gallery)

Diar Masri - Data Science at LMC (video, slides)

Searching for job is a struggle that LMC is trying to simplify. Therefore, I would like to introduce a practical overview of how data science is applied at LMC to achieve such a goal. We do so by applying data science to better understand our users experience and abilities using techniques of Natural Language Processing (NLP). Moreover, we examine current preferences of our users using Recommender System in order to recommend them the most relevant job options and shorten the path to their next career.

Vojtěch Filipec - Curious about a new place? Explore many of them via OpenStreetMaps API (video, slides and code)

Knowledge of the neighborhoods may identify deficient services to a marketing company, it also enables you to address your spouse’s inquiry about points of interest on a day-trip. To acquire such knowledge, you need a rich source of data with decent API. OpenStreetMaps offer both. This talk explains how to access the data points in OSM and how to query their labels and content using two Python libraries (overpass and overpy) as well as the underlying OSM query language.

PyData Prague #3 - Mouse Experience (24. 6. 2019 at Microsoft)

Karla Fejfarová - My Data Look Like a Mouse (video, slides)

This talk will give you a short tour of the Czech Centre for Phenogenomics in the BIOCEV centre in Vestec and a brief overview of current research in mouse-based functional genomics. Karla will also present various types of data we generate during the research and will show you how Python helps us to overcome some of the everyday challenges we face. Karla Fejfarová is a biostatistician at the Czech Centre of Phenogenomics.

Tomáš Muchka - What the Heck Does UX Mean? (video)

They work in all kinds of companies from startups to corporates. They could be sitting right next to you; the UX designers. But what is their actual job and how could you help each other? Let’s find out with Tomáš Muchka, a senior UX designer at GoodData.

PyData Prague #2 - Optimized Elasticity (23. 1. 2019 at FIT CTU)

Honza Král - The how and why of Elasticsearch (video)

Elasticsearch is an open source distributed datastore, we will see what makes it tick, how to use it from Python and try to draw some conclusions as to where it might fit in in your organization. Honza Král is a principal consulting architect from Elastic and a long-term core committer to Django.

Jakub Urban - Optimizing numerical calculations in Python (video)

Jakub Urban, a senior Pythonista from Quantlane with rich experience in scientific computing and modelling, will show various possibilities for making your (mostly) numerical calculations in Python fast. He will cover optimization and parallelization using Numpy, Numba, Cython or Dask. You will learn that Python can be as fast as Fortran with a very little effort. In case it cannot, you will see how to seamlessly turn Fortran/C/C++ into a Python module.

PyData Prague #1 - Open Source and Open Communities (18. 10. 2018 at Avast)

Ian Ozsvald - NumFOCUS and PyData + Delivering Data Science Projects (video)

Ian Ozsvald, a London-based data scientist with 15+ years of experience, he co-founded PyData London, one of the largest PyData communities in the world. Ian will introduce PyData, NumFOCUS and the community behind these initiatives. He’ll then talk about successfully delivering a data science project.

Štěpán Roučka - Symbolic Computing with SymPy (video)

Štěpán Roučka, a contributor to the SymPy project, a computer algebra system widely used within academia and industry, both as a standalone tool and as part of other scientific packages. Štepán will give a brief overview of SymPy’s functionality and will show you a few examples, which may motivate you to add SymPy to your data science toolbox.