Check out the 7 different topics, each representing a single block of talks and sessions to guide you through the progam of
DevTalks Cluj.

Emerging tech
Devops
Web & Mobile
Datafication
Java
Product Management
Smart Everything
Would you like to find out more about "Metrics in Distributed Systems"? Idan Tovi, Head of SRE at PayU,  shares with us more about it! "Collecting metrics in a distributed system can be a real challenge, especially if the system wasn’t designed in advance. The number of different repositories and technologies, the way to collect the metrics, the number of different metrics, dashboards and alert creation are just some of the challenges. Still, we managed to overcome those challenges in less than 3 months! In this article I will explain how.

How it All Started

The system we had in place served our company’s customers well, but had started to become overloaded. So we decided to build a new system, starting with a minimum set of principles:
  • API first
  • SMACK (Spark, Mesos, Akka, Cassandra and Kafka)
  • Full automation
  • Horizontal scale
  • Small teams
The advantage of this approach that we now had a team of 20 talented engineers searching/investigating/uncovering the microservices landscape in a high pace. The downside, however, was that we didn’t have any standardization regarding observability. So we knew we needed monitoring tools. We already had logs, but quickly we understood that we needed another level of observability except for that. So we decided to focus on metrics.

The Initial Approach

The technology stack we decided to use was Sysdig, which is a very nice commercial metrics solution, together with an open source package written by me, and oh what a bad package it is… :) The package collected too many metrics and also caused a nice memory leak. The fact that everyone on the engineering team wanted custom metrics for their own service, led to a huge number of metrics. This caused massive load on the Sysdig system which was no longer responsive and eventually not useful. Besides the system was still in its early stages with no real customers so who needs metrics anyway right?!? Logs were enough for now, so we accepted the downside of using Sysdig.

Taking it a Step Further

A year later the system started to take shape. The traffic started to shift from the legacy system and we kept adding more and more microservices. Just to give you a sense of what our stack included back then: DC/OS, LinkerD, Kafka, Cassandra, Druid, Spark, Elastic Search, Consul, Vault, 4 different programming languages and everything was dockerised and based on AWS. At this stage, our Infra team felt like they must have some better way to monitor this growing stack and they decided to give metrics a second chance. This time we decided to go with InfluxDB. We started to collect the infrastructure metrics and then asked the developers to add some services metrics  However, Influx didn’t take it well. When we started to add the service metrics, Influx doesn't handle large numbers of time series well. Still, we weren’t yet using the full potential of the system and we had a limited number of services, so by making a couple of improvements to the application logs we didn’t feel the lack of metrics and Influx gave us some mid term solution for the infrastructure. We knew this could not scale but had more urgent things to handle. And indeed, as another year passed, we started to onboard larger customers. The load on the system was growing fast and logs were no longer enough.

Giving Metrics Another Try

As always, first we chose the technologies. We knew Prometheus become part of the CNCF (Cloud Native Computing Foundation) and because our system is a cloud-native one we thought we should give it a try. As part of this choice we had to re-write our open source package, so we decided to write it from scratch and learn from our mistakes. We decided to expose by default much fewer metrics and only valuable ones. Then we thought about what else we should do different, in order to succeed this time?!? And then we came up with the most important piece of the puzzle - we set up the SRE team, which took responsibility for the challenge. So for the first time, we had a team that was accountable for the observability of the system. But how should we handle more than 200 different services and a growing tech stack??? The approach we took, is to divide the services into groups. So we started to think about how to group our services,  by summarizing the 3 most common practices of what you should monitor:
  • RED (Rate, Erros, Duration) - which is more focused on the application
  • USE - Utilization, Saturation, Error - which is a better fit for infrastructure
  • 4 Goldan signals (Latency, Traffic, Errors, and Saturation) 
We figured out that our services can be categorized by their functionality. So we defined our groups accordingly. Each group contains different metrics that we should monitor. In addition, each service can be part of more than one group; for example, a service with an API that also produces events to Kafka and is written in Scala, will be part of the 3 groups. Of course, this approach is flexible and we can always add more and more groups as we grow and as our tech stack evolves. The groups we choose to start with are:
  • General metrics like memory and CPU usage
  • Programming language specific like Node.js event look, JVM metrics and Golang goroutines.
  • Message broker Consumers
  • Message broker Producers
 

Taking an ‘Everything as Code’ Approach

We love think of everything as code. We automate everything, even our documentation site is automated (not just the deployment but even some of the content is automatically generated). We decided to take the same approach with our dashboards and alerts. This is crucial because it helps developers embed this step in their pipeline as part of the provisioning. It also makes things deterministic, since we will always create the same alerts/dashboards for every service within the group. In addition, it makes our dashboards and alerts reproducible in case of a major incident to our monitoring system. And last but not least, it makes collaboration easy: one of our biggest lessons in how to not block your engineering team from scaling up either in number of people or number of technologies is to make it easy to collaborate - so every engineer is more than welcome to add more or improve an existing group.

The Netflix Paved Road Culture

So far, we have technology, a team, a good approach  for grouping our services, a clear understanding of what we should monitor for each group and an easy way to start monitoring and contributing. But how should we bring this all together for our engineers. Netflix Paved Road culture was the answer to this question. In a nutshell, it describes the relationship between Netflix teams (consumers and producers). The idea is to give freedom to the engineers, but also help them to focus on their main business logic. We thus decided to build three metrics packages to collect metrics very easily from our application and we also created default panels for each group mentioned earlier. Those panels can be used via a simple CLI tool which is also packed in a docker container and makes it very easy to add a step in every service pipeline to add dashboards in Grafana and Alerts in the AlertManager. It actually can take less than 10 minutes to add and provision metrics, dashboards and alerts for a service. In addition, from now on every new technology we will use that fits into an existing group has minimum requirements regarding the metrics it should expose. If there’s a technology which has no suitable group, everyone knows how to add it to our metrics ecosystem. So far so good for application metrics, but what should we do with an infrastructure that is more unique and harder to divide into groups The answer is the same as we did with our services - EaC is the guiding principle. We added an automated way to upload full dashboards as part of the infrastructure provisioning/configuration pipeline (the same CLI tool for the apps). The main difference in this case, is that we upload a full dashboard. Using those two main approaches and the right technologies, we managed to resolve a lack of observability of a pretty large technology stack and a lot of different services in less than 3 months, but obviously we weren’t able to complete that without learning our lessons from our previous failures.

In Summary

So to recap, those are the main takeaways I think everyone should take from our experience:
  1. As everything in a distributed system, it is always harder than with a monolith.
  2. If you can, design your metrics from the beginning. It is important to have metrics for observability and production debugging, but also in order to know you meet your SLOs.
  3. Either way, if you are closing the gaps like us or designing from the beginning, EaC and the Paved road are amazing principles and in my opinion, the only way to scale either your system and your engineering.
  4. Choose the right technologies for your type of system. Prometheus is amazing for cloud native distributed system, but there are a lot of great solutions out there that might fit you system better.
  5. Make someone responsible and accountable for your system production observability, otherwise it will pushed back in priority until it becomes critical and critical can be too late.
  6. Use the community, there is a huge amount of knowledge already built in the community - SRE is not something specific to Google anymore.
  7. And never give up - as we believe “fail fast, learn fast”."
 
Training air traffic control experts (ATC) is a domain that belongs to the public safety industry where adoption of latest technologies proves to be the right choice for getting better risks reduction. Adopting mobile, cloud, speech recognition and artificial intelligence technologies in the process of training the air traffic controllers, makes it much more accessible in terms of technical equipment and, of course, affordable. The public safety industry, despite its huge dependency on technology, is one of the latest adopters of the latest technological advancements, given its need to use the most reliable and mature technology available. This principle can easily be seen in the usage of analogous communications despite 5G being deployed in commercial and civil environments. Current Air Traffic Simulators, used as the foundation of ATC experts training, are complex systems that replicate voice communications, RADAR consoles, meteorological information systems, runway lights controls, surface movement radar consoles and, of course, a 3D replica of the out of control tower view. All these conditions have a huge impact on the costs of ATC training infrastructures, making it a premium facility, and not all ANSP (air navigation service providers) may afford it. This requires multiple ATC to get trained in simulators abroad and have this experience as a premium, rather than an accessible infrastructure they could use at any point in their career. ATC simulators need to replicate 100% operational systems’ functionality, but do not need to use 100% the same technology as an operational system. We have developed a platform that maps 100% operational functionality, but relies on a set of latest/advanced technologies as a foundation that brings new horizons to the ATC training options. The first layer of the new technologic foundation is an entire new scenario engine that includes complex management and rendering capabilities, has multiple interoperability end points and allows diverse scenario management. The second layer build on top of the foundation is the presentation layer that has been abstracted and made mobile first. This allows the deployment of the simulator on less complex hardware such as commercially available tablets. All components of the simulator are now running on tablets, making the ATC Simulator accessible in general mobile app stores and on affordable devices. The third layer of new technologies used in the simulator platform is comprised by three of the latest technologies: voice recognition as support of the input options, Augmented Reality to deliver immersive experiences, and respectively Artificial Intelligence engines used to learn ATC behaviour during crisis scenarios. Introduction of new technologies in the rigid industry of public safety generates an impressive set of advantages in regards to ATC training options. The first advantage comes from the new business model that is enabled by the availability of the training and simulation platform on mobile devices. The heavy and complex infrastructures are no longer required to test your ATC abilities and to run complex emergency scenarios. All can be now done from your tablet using an app downloaded from your traditional app store. This builds a new business model, ATC training as a service, and the availability of it increases dramatically, adding to the pool of ANSP customers (two-three hundred), the large pool of ATC (300-400 K worldwide) that can procure the platform directly. Adding a new perception dimension to the training process is just as important: using augmented reality for the traditional 2D space of RADAR consoles. Now, given the AR capabilities of recent tablets on the market, we are able to deliver Augmented Reality regarding the evaluation process of a particular ATC exercise. Another innovation added to the platform is its shared cloud infrastructure. While in the traditional model, each training centre would be able to use just its own training scenarios, now, using cloud technologies, multiple training centres can openly share their scenarios and approach to solve them, so collaboration gets a new definition among ATC specialists, and latest techniques to ATC are now being published faster for a global audience. This work is part of the project Innovative informatic instrument for the training and testing of Air Traffic Controllers, developed by SC Sim Soft Distribution SRL company. This project is co-financed through the European Regional Development Fund – Priority Axis 2 – “Information and communication technology for a more competitive digital economy” by the Operational Programme “Innovations and Competitiveness” 2014-2020. The contents of this material do not necessarily represent the official position of the European Union or of the Romanian Government.
The Future of Instant Messaging for Organizations – A Design Thinking Approach With GDPR applicable since May 2018, many companies started taking decisive measures against “shadow” usage of messaging apps. Continental banned the usage of WhatsApp and Snapchat on company mobiles. Deutsche Bank went even further, by banning instant messaging as well as all messaging apps. However, the need for instant messaging to exchange critical business data in real time remains, especially considering the fast paced business environment, as well as the customer expectations to gain access to the business and government services as quickly as possible. After having discussions with several clients, many of them dealing with sensitive and private data (e.g. medical services providers), they point out that due to the dispersal of relevant data and information, instant messaging remains essential in gathering and propagating vital data in real-time. At the same time, managers want to be in control of who accesses the data they exchange. This means using the best encryption available (end-to-end), as well as hosting the messaging solution in their private cloud. To fully cover the security concerns, an MDM (Mobile Device Management) component has to be part of the solution. GDPR compliance is a must. For increased usability, integration with enterprise apps is also required. With these requirements in mind and with a strong desire to provide the client with an app that ticks all the boxes, the Romanian company Trencadis is developing the FORTYPE messaging app. As a recognition of the fact that stepping out of the “shadow IT” zone means not only banning existing apps, but also providing companies and governments with a proper solution, the EU awarded a grant for building FORTYPE, product developed within the project TALOS - Secure Mobile Intra-organizational Communication, implemented by Trencadis Corp. We have brought together the flexibility of the Design Thinking approach, centered on end-users, with MDM solution for administrators, and came up with four core features that were not considered before when taking into account the “bread and butter” functionalities of typical instant messaging apps. 1. In-app document scanning. Focusing on specific use cases from the medical domain, we learned that sometimes physical documents need to be sent via instant messaging for rapid sharing. This is possible by using device specific functionalities (like taking pictures, then sending them), but this is a rather lengthy and error prone process, especially when we discuss about multi-page documents. The in-app functionality enables multi-page scanning, followed immediately by sharing with the relevant users. 2. Large files sharing. Most instant messaging apps impose reasonable limitations for the message attachment (e.g. 100 MB limitation in WhatsApp). However, when we talk about medical files, these can be rather large – CT scans, MRI can reach GB sizes. To overcome this limitation, FORTYPE will provide a built-in file-sharing option, automatically activated when the attachment is above a certain value. By simply attaching a large document, it will be placed in the file-sharing location and the message will contain a link that will allow the relevant users to download it when needed. 3. Temporary enrollment of external users. While FORTYPE is designed as an internal instant messaging platform, we had to recognize the fact that in certain business scenarios, like the need to quickly pull together recent medical information for a patient, relevant data sits with external users. They need to be quickly enrolled in order to allow them to securely share the information they hold, be it medical records or imaging files. With this in mind, we created an external user enrollment process initiated via an SMS sent to the external user’s phone number. 4. Configurable user rights. This functionality came as a logical consequence of enrolling external users. For example, when external users are part of an ad-hoc group focused on pulling together all relevant medical information for a patient, the external users should be able to visualize only the attachments they provided and not the attachments shared by other users. This is required in order to avoid unwanted disclosure of confidential medical information. Obviously, the internal users in charge with gathering all relevant medical documents have the right to access all attachments provided by internal and external users. We believe that Design Thinking, centered on users’ needs and on defining detailed interaction scenarios, is extremely valuable in determining new products within fields that seem crowded or saturated. If we take a closer look at a business, we will come to the realization that the lines between product/services and user environments are blurring. If companies can bring out an integrated customer experience, it will open up opportunities to build new businesses. This project is co-financed with funding from the European Regional Development Fund - Priority Axis 2 - “Information and Communication Technology (ICT) fit a competitive digital economy”, through the Operational Competitiveness Program 2014-2020. The content of this material does not necessarily represent an official position of the European Union or the Government of Romania.
Fishpointer, the first social-interactive map for fishing, is a 100% Romanian product that will address the international public which is estimated at over 150 million fishing enthusiasts connected to the Internet. The Fishpointer application combines interactive electronic maps and socialization tools that allows users to record their own catch data, generate personal statistics, and socialize on fishing experiences. The app is available for free on Google play, AppStore and WebApp. https://play.google.com/store/apps/details?id=com.fishpointer https://itunes.apple.com/ro/app/fishpointer/id977021201?l=ro&mt=8
Maponia - user generated content app dedicated to professional drivers, will change the transportation pattern of the merchandise in Europe. Offering drivers power, autonomy and better transportation by efficient planning, safer route and more pleasant driving experience, using the power of community. Available for free on Google play, AppStore and WebApp. www.youtube.com/embed/RyIt2CRRJlo https://itunes.apple.com/ro/app/maponia/id1342623172?mt=8 https://play.google.com/store/apps/details?id=com.maponia
09:30 - 09:45
Welcome Opening
10:00 - 10:15
Intro Speech – Bosch

Speakers Panel

Placeholder
Kaschek F. Konrad

Speaker

Placeholder
Adriana Dumitrof

Moderator

10:15 - 11:00
Placeholder
Antonio Ferreira -
Blockchain 2.0: from Automotive 4.0 » to Mobility 3.0 » to Smart City 2.0

Speaker

11:00 - 11:45
Placeholder
Radu Hârceagă -
Industry 4.0, Technology behind technology

Speaker

11:45 - 12:25
Placeholder
Septimiu Nechifor -
Internet of Things – IOT an interplay between IT and OT. A quest for right questions

Speaker

12:25 - 13:00
Placeholder
Daniel Costea -
Bringing .NET Bot alive with .NET Core, Machine Learning and IoT!

Speaker

13:00 - 14:00
Networking Lunch
14:05 - 14:45
Placeholder
Cristian Dragu -
Collective intelligence – the bond between technology and smart communities

Speaker

14:50 - 15:35
Placeholder
Ciprian Alexandru Caragea -
Trusted data and smart statistics valuing the digital era

Speaker

15:40 - 16:20
Placeholder
Karina Popova -
IoT & Agriculture: precision farming challenges

Speaker

16:25 - 17:05
Placeholder
Alex Bordei -
DIY Automation – playing with IoT and wearables

Speaker

09:30 - 09:45
BIG OPENING - Main Stage
09:30 - 09:45
Welcome Opening
09:30 - 09:45
Welcome Opening
10:00 - 10:15
Intro Speech – Automation is here

Speakers Panel

Placeholder
Andrei Roth

Speaker

Placeholder
Monica Alexandru

Moderator

10:00 - 10:15
Intro Speech – Metro Systems

Speakers Panel

Placeholder
Ivan George

Speaker

Placeholder
Tudor Adam

Moderator

10:00 - 10:15
Placeholder
Mike Elsmore -
Intro Speech

Speaker

10:15 - 11:00
Placeholder
Adam Bien -
Jakarta EE + MicroProfile = Productivity, Sustainability, Fun #slideless

Speaker

10:15 - 11:00
Placeholder
Em Grasmeder -
CONTINUOUS INTELLIGENCE: LEVERAGING DATA SCIENCE WITH CONTINUOUS DELIVERY

Speaker

11:00 - 11:45
Placeholder
Ovidiu Petridean -
Data Management with Microservices

Speaker

11:00 - 11:45
Kafka & us – from buzzword to bff

Speakers Panel

Placeholder
Madalina Lazar

Speaker

Placeholder
Alexandra Ureche

Speaker

10:15 - 11:00
Placeholder
Jock Busuttil -
Be More Human

Speaker

11:00 - 11:45
Placeholder
Ovidiu Ponoran -
THE MINDSET OF SUCCESSFUL VALUE CREATORS

Speaker

11:45 - 12:25
Placeholder
Valentina Crisan -
The nitty-gritty of Kafka Streams

Speaker

11:45 - 12:25
Placeholder
Felice Pescatore -
Don’t Dirty my Backlog: Healthy Product Backlog

Speaker

11:45 - 12:25
Tools Java Developers Should Learn in 2030

Speakers Panel

Placeholder
Ties van de Ven

Speaker

Placeholder
Adam Bien

Speaker

Placeholder
Cristian Baita

Speaker

12:25 - 13:00
Placeholder
Grosu Andrei-Nicolae -
Reactive Web with Spring Boot

Speaker

12:25 - 13:00
Placeholder
Paul Chirila -
We are the product

Speaker

13:00 - 14:00
Networking Lunch
13:00 - 14:00
Networking Lunch
13:00 - 14:00
Networking Lunch
14:05 - 14:45
Placeholder
James Murphy -
Mindset over skillset

Speaker

14:05 - 14:45
Placeholder
Felix Crisan -
(Big) Data in (Block) Chains

Speaker

14:05 - 14:45
Placeholder
Cristian Baita -
Growing programing skills – tips and tricks

Speaker

14:50 - 15:35
Placeholder
Radu Vunvulea -
Demystifying messaging communication patterns

Speaker

14:50 - 15:35
Placeholder
Bogdan Solga -
Reactive Programming with Spring

Speaker

14:50 - 15:35
Placeholder
Catalina Banuleasa -
How do we bridge people and technology through design

Speaker

15:40 - 16:20
Placeholder
Marton Kodok -
BigQuery ML – Machine learning at scale using SQL

Speaker

15:40 - 16:20
Placeholder
Ties van de Ven -
CONCEPTS THAT IMPROVE THE WAY YOU THINK ABOUT CODE

Speaker

15:40 - 16:20
Placeholder
Gary Crawford -
There’s nothing we can do

Speaker

16:25 - 17:05
How to Build Products People Care About

Speakers Panel

Placeholder
Oliver Gibson

Speaker

Placeholder
Monira Rhaimi

Speaker

16:25 - 17:05
Placeholder
Cristian Lungu -
On being a Machine Learning Engineer

Speaker

09:30 - 09:45
Welcome Opening
10:00 - 10:15
Intro speech – DevOps: What’s next?

Speakers Panel

Placeholder
Mihai Seulean

Speaker

Placeholder
Alex Lakatos

Moderator

10:15 - 11:00
Placeholder
Alex Casalboni -
Configuration management and service discovery in a serverless world

Speaker

11:00 - 11:45
Placeholder
Sergiu Cisar -
The road of transformation: From On-prem to Cloud

Speaker

11:45 - 12:25
Placeholder
Eugene Istrati -
TERRAFORM FOR SERVERLESS. BEST PRACTICES. LESSONS LEARNED.

Speaker

12:25 - 13:00
Placeholder
Victor Ionescu -
Finding your sweet spot on the cloud computing continuum

Speaker

13:00 - 14:00
Networking Lunch
14:05 - 14:45
The Future of DevOps for the Enterprise – Trends & Insights

Speakers Panel

Placeholder
Felice Pescatore

Speaker

Placeholder
Eugene Istrati

Speaker

Placeholder
Alex Lakatos

Speaker

14:50 - 15:35
Placeholder
Björn Rabenstein -
Prometheus – what’s new and what’s next?

Speaker

15:40 - 16:20
Placeholder
Felice Pescatore -
VALUE FOCUS TEAM: road to DevOps

Speaker

09:30 - 09:45
Welcome Opening
10:00 - 10:15
Intro Speech – Softvision

Speakers Panel

Placeholder
Alin Turcu

Speaker

Placeholder
Stefania Ioana Chiorean

Moderator

10:15 - 11:00
Placeholder
Willian Martins da Silva -
Back to the future of JS II: Beyond what we can foresee

Speaker

11:00 - 11:45
Placeholder
Andrei Cristea -
Offensive OSINT: Get the data

Speaker

11:45 - 12:25
Placeholder
Radu Marin -
Userspace Invaders

Speaker

12:25 - 13:00
Placeholder
Paul Ardelean -
Your Smartphone Is Also A Phone!

Speaker

13:00 - 14:00
Networking Lunch
14:05 - 14:45
Placeholder
Alex Bordei -
Making Alexa able to provide real-time DevTalks Agenda related answers

Speaker

14:50 - 15:35
Placeholder
Mike Elsmore -
All the small things

Speaker

09:30 - 09:45
Welcome opening
10:00 - 10:15
Intro Speech – MHP Romania

Speakers Panel

Placeholder
Dr. Oliver Oswald

Speaker

Placeholder
Paul Ardelean

Moderator

10:15 - 11:00
Placeholder
Florin Otto -
Blockchain: a road to the future

Speaker

11:00 - 11:30
Placeholder
Dragos Filipovici -
PWAs on the shoulders of Serverless giants

Speaker

11:30 - 12:00
Placeholder
Constantin Müller -
Algorithmic production in smart factory with closed loop processes

Speaker

12:00 - 12:45
Placeholder
Silviu-Tudor Serban -
The true potential of AI for well-being, safety and privacy

Speaker

12:45 - 13:15
Placeholder
Tudor Lapusan -
Visual interpretation of Decision Tree models

Speaker

13:00 - 14:00
Networking Lunch
14:05 - 14:45
Placeholder
Simina Serban -
Cloud data integrations with Azure Data Factory and Logic Apps

Speaker

14:50 - 15:35
Placeholder
Julia Biro -
Introduction to flow-based programming with Node-RED

Speaker

15:40 - 16:20
Placeholder
Håkan Silfvernagel -
Affective Computing – What is it and why should I care?

Speaker

16:25 - 17:05
Placeholder
Viral Parmar -
WebVR : Beyond Imagination

Speaker

15:40 - 16:20
Placeholder
Trishul Goel -
PWAs on steroids

Speaker

16:25 - 17:05
Placeholder
Vladimir Grinenko -
Experience we’ve got building design system for 300+ developers

Speaker

17:05 - 17:45
Placeholder
Paul Apostol -
Required Tech vs. Emerging Tech

Speaker

16:25 - 17:05
Placeholder
Radu Vunvulea -
Tools and competences on DevOps

Speaker

Partners