Check out the different topics, each representing a single block of talks and sessions to guide you through the progam of
DevTalks Bucharest.

Day 1
Day 2
Emerging tech
DevOps
Web
Smart Everything
Java
Big Data&Cloud
Product Management
Would you like to find out more about "Metrics in Distributed Systems"? Idan Tovi, Head of SRE at PayU,  shares with us more about it! "Collecting metrics in a distributed system can be a real challenge, especially if the system wasn’t designed in advance. The number of different repositories and technologies, the way to collect the metrics, the number of different metrics, dashboards and alert creation are just some of the challenges. Still, we managed to overcome those challenges in less than 3 months! In this article I will explain how.

How it All Started

The system we had in place served our company’s customers well, but had started to become overloaded. So we decided to build a new system, starting with a minimum set of principles:
  • API first
  • SMACK (Spark, Mesos, Akka, Cassandra and Kafka)
  • Full automation
  • Horizontal scale
  • Small teams
The advantage of this approach that we now had a team of 20 talented engineers searching/investigating/uncovering the microservices landscape in a high pace. The downside, however, was that we didn’t have any standardization regarding observability. So we knew we needed monitoring tools. We already had logs, but quickly we understood that we needed another level of observability except for that. So we decided to focus on metrics.

The Initial Approach

The technology stack we decided to use was Sysdig, which is a very nice commercial metrics solution, together with an open source package written by me, and oh what a bad package it is… :) The package collected too many metrics and also caused a nice memory leak. The fact that everyone on the engineering team wanted custom metrics for their own service, led to a huge number of metrics. This caused massive load on the Sysdig system which was no longer responsive and eventually not useful. Besides the system was still in its early stages with no real customers so who needs metrics anyway right?!? Logs were enough for now, so we accepted the downside of using Sysdig.

Taking it a Step Further

A year later the system started to take shape. The traffic started to shift from the legacy system and we kept adding more and more microservices. Just to give you a sense of what our stack included back then: DC/OS, LinkerD, Kafka, Cassandra, Druid, Spark, Elastic Search, Consul, Vault, 4 different programming languages and everything was dockerised and based on AWS. At this stage, our Infra team felt like they must have some better way to monitor this growing stack and they decided to give metrics a second chance. This time we decided to go with InfluxDB. We started to collect the infrastructure metrics and then asked the developers to add some services metrics  However, Influx didn’t take it well. When we started to add the service metrics, Influx doesn't handle large numbers of time series well. Still, we weren’t yet using the full potential of the system and we had a limited number of services, so by making a couple of improvements to the application logs we didn’t feel the lack of metrics and Influx gave us some mid term solution for the infrastructure. We knew this could not scale but had more urgent things to handle. And indeed, as another year passed, we started to onboard larger customers. The load on the system was growing fast and logs were no longer enough.

Giving Metrics Another Try

As always, first we chose the technologies. We knew Prometheus become part of the CNCF (Cloud Native Computing Foundation) and because our system is a cloud-native one we thought we should give it a try. As part of this choice we had to re-write our open source package, so we decided to write it from scratch and learn from our mistakes. We decided to expose by default much fewer metrics and only valuable ones. Then we thought about what else we should do different, in order to succeed this time?!? And then we came up with the most important piece of the puzzle - we set up the SRE team, which took responsibility for the challenge. So for the first time, we had a team that was accountable for the observability of the system. But how should we handle more than 200 different services and a growing tech stack??? The approach we took, is to divide the services into groups. So we started to think about how to group our services,  by summarizing the 3 most common practices of what you should monitor:
  • RED (Rate, Erros, Duration) - which is more focused on the application
  • USE - Utilization, Saturation, Error - which is a better fit for infrastructure
  • 4 Goldan signals (Latency, Traffic, Errors, and Saturation) 
We figured out that our services can be categorized by their functionality. So we defined our groups accordingly. Each group contains different metrics that we should monitor. In addition, each service can be part of more than one group; for example, a service with an API that also produces events to Kafka and is written in Scala, will be part of the 3 groups. Of course, this approach is flexible and we can always add more and more groups as we grow and as our tech stack evolves. The groups we choose to start with are:
  • General metrics like memory and CPU usage
  • Programming language specific like Node.js event look, JVM metrics and Golang goroutines.
  • Message broker Consumers
  • Message broker Producers
 

Taking an ‘Everything as Code’ Approach

We love think of everything as code. We automate everything, even our documentation site is automated (not just the deployment but even some of the content is automatically generated). We decided to take the same approach with our dashboards and alerts. This is crucial because it helps developers embed this step in their pipeline as part of the provisioning. It also makes things deterministic, since we will always create the same alerts/dashboards for every service within the group. In addition, it makes our dashboards and alerts reproducible in case of a major incident to our monitoring system. And last but not least, it makes collaboration easy: one of our biggest lessons in how to not block your engineering team from scaling up either in number of people or number of technologies is to make it easy to collaborate - so every engineer is more than welcome to add more or improve an existing group.

The Netflix Paved Road Culture

So far, we have technology, a team, a good approach  for grouping our services, a clear understanding of what we should monitor for each group and an easy way to start monitoring and contributing. But how should we bring this all together for our engineers. Netflix Paved Road culture was the answer to this question. In a nutshell, it describes the relationship between Netflix teams (consumers and producers). The idea is to give freedom to the engineers, but also help them to focus on their main business logic. We thus decided to build three metrics packages to collect metrics very easily from our application and we also created default panels for each group mentioned earlier. Those panels can be used via a simple CLI tool which is also packed in a docker container and makes it very easy to add a step in every service pipeline to add dashboards in Grafana and Alerts in the AlertManager. It actually can take less than 10 minutes to add and provision metrics, dashboards and alerts for a service. In addition, from now on every new technology we will use that fits into an existing group has minimum requirements regarding the metrics it should expose. If there’s a technology which has no suitable group, everyone knows how to add it to our metrics ecosystem. So far so good for application metrics, but what should we do with an infrastructure that is more unique and harder to divide into groups The answer is the same as we did with our services - EaC is the guiding principle. We added an automated way to upload full dashboards as part of the infrastructure provisioning/configuration pipeline (the same CLI tool for the apps). The main difference in this case, is that we upload a full dashboard. Using those two main approaches and the right technologies, we managed to resolve a lack of observability of a pretty large technology stack and a lot of different services in less than 3 months, but obviously we weren’t able to complete that without learning our lessons from our previous failures.

In Summary

So to recap, those are the main takeaways I think everyone should take from our experience:
  1. As everything in a distributed system, it is always harder than with a monolith.
  2. If you can, design your metrics from the beginning. It is important to have metrics for observability and production debugging, but also in order to know you meet your SLOs.
  3. Either way, if you are closing the gaps like us or designing from the beginning, EaC and the Paved road are amazing principles and in my opinion, the only way to scale either your system and your engineering.
  4. Choose the right technologies for your type of system. Prometheus is amazing for cloud native distributed system, but there are a lot of great solutions out there that might fit you system better.
  5. Make someone responsible and accountable for your system production observability, otherwise it will pushed back in priority until it becomes critical and critical can be too late.
  6. Use the community, there is a huge amount of knowledge already built in the community - SRE is not something specific to Google anymore.
  7. And never give up - as we believe “fail fast, learn fast”."
 
Training air traffic control experts (ATC) is a domain that belongs to the public safety industry where adoption of latest technologies proves to be the right choice for getting better risks reduction. Adopting mobile, cloud, speech recognition and artificial intelligence technologies in the process of training the air traffic controllers, makes it much more accessible in terms of technical equipment and, of course, affordable. The public safety industry, despite its huge dependency on technology, is one of the latest adopters of the latest technological advancements, given its need to use the most reliable and mature technology available. This principle can easily be seen in the usage of analogous communications despite 5G being deployed in commercial and civil environments. Current Air Traffic Simulators, used as the foundation of ATC experts training, are complex systems that replicate voice communications, RADAR consoles, meteorological information systems, runway lights controls, surface movement radar consoles and, of course, a 3D replica of the out of control tower view. All these conditions have a huge impact on the costs of ATC training infrastructures, making it a premium facility, and not all ANSP (air navigation service providers) may afford it. This requires multiple ATC to get trained in simulators abroad and have this experience as a premium, rather than an accessible infrastructure they could use at any point in their career. ATC simulators need to replicate 100% operational systems’ functionality, but do not need to use 100% the same technology as an operational system. We have developed a platform that maps 100% operational functionality, but relies on a set of latest/advanced technologies as a foundation that brings new horizons to the ATC training options. The first layer of the new technologic foundation is an entire new scenario engine that includes complex management and rendering capabilities, has multiple interoperability end points and allows diverse scenario management. The second layer build on top of the foundation is the presentation layer that has been abstracted and made mobile first. This allows the deployment of the simulator on less complex hardware such as commercially available tablets. All components of the simulator are now running on tablets, making the ATC Simulator accessible in general mobile app stores and on affordable devices. The third layer of new technologies used in the simulator platform is comprised by three of the latest technologies: voice recognition as support of the input options, Augmented Reality to deliver immersive experiences, and respectively Artificial Intelligence engines used to learn ATC behaviour during crisis scenarios. Introduction of new technologies in the rigid industry of public safety generates an impressive set of advantages in regards to ATC training options. The first advantage comes from the new business model that is enabled by the availability of the training and simulation platform on mobile devices. The heavy and complex infrastructures are no longer required to test your ATC abilities and to run complex emergency scenarios. All can be now done from your tablet using an app downloaded from your traditional app store. This builds a new business model, ATC training as a service, and the availability of it increases dramatically, adding to the pool of ANSP customers (two-three hundred), the large pool of ATC (300-400 K worldwide) that can procure the platform directly. Adding a new perception dimension to the training process is just as important: using augmented reality for the traditional 2D space of RADAR consoles. Now, given the AR capabilities of recent tablets on the market, we are able to deliver Augmented Reality regarding the evaluation process of a particular ATC exercise. Another innovation added to the platform is its shared cloud infrastructure. While in the traditional model, each training centre would be able to use just its own training scenarios, now, using cloud technologies, multiple training centres can openly share their scenarios and approach to solve them, so collaboration gets a new definition among ATC specialists, and latest techniques to ATC are now being published faster for a global audience. This work is part of the project Innovative informatic instrument for the training and testing of Air Traffic Controllers, developed by SC Sim Soft Distribution SRL company. This project is co-financed through the European Regional Development Fund – Priority Axis 2 – “Information and communication technology for a more competitive digital economy” by the Operational Programme “Innovations and Competitiveness” 2014-2020. The contents of this material do not necessarily represent the official position of the European Union or of the Romanian Government.
The Future of Instant Messaging for Organizations – A Design Thinking Approach With GDPR applicable since May 2018, many companies started taking decisive measures against “shadow” usage of messaging apps. Continental banned the usage of WhatsApp and Snapchat on company mobiles. Deutsche Bank went even further, by banning instant messaging as well as all messaging apps. However, the need for instant messaging to exchange critical business data in real time remains, especially considering the fast paced business environment, as well as the customer expectations to gain access to the business and government services as quickly as possible. After having discussions with several clients, many of them dealing with sensitive and private data (e.g. medical services providers), they point out that due to the dispersal of relevant data and information, instant messaging remains essential in gathering and propagating vital data in real-time. At the same time, managers want to be in control of who accesses the data they exchange. This means using the best encryption available (end-to-end), as well as hosting the messaging solution in their private cloud. To fully cover the security concerns, an MDM (Mobile Device Management) component has to be part of the solution. GDPR compliance is a must. For increased usability, integration with enterprise apps is also required. With these requirements in mind and with a strong desire to provide the client with an app that ticks all the boxes, the Romanian company Trencadis is developing the FORTYPE messaging app. As a recognition of the fact that stepping out of the “shadow IT” zone means not only banning existing apps, but also providing companies and governments with a proper solution, the EU awarded a grant for building FORTYPE, product developed within the project TALOS - Secure Mobile Intra-organizational Communication, implemented by Trencadis Corp. We have brought together the flexibility of the Design Thinking approach, centered on end-users, with MDM solution for administrators, and came up with four core features that were not considered before when taking into account the “bread and butter” functionalities of typical instant messaging apps. 1. In-app document scanning. Focusing on specific use cases from the medical domain, we learned that sometimes physical documents need to be sent via instant messaging for rapid sharing. This is possible by using device specific functionalities (like taking pictures, then sending them), but this is a rather lengthy and error prone process, especially when we discuss about multi-page documents. The in-app functionality enables multi-page scanning, followed immediately by sharing with the relevant users. 2. Large files sharing. Most instant messaging apps impose reasonable limitations for the message attachment (e.g. 100 MB limitation in WhatsApp). However, when we talk about medical files, these can be rather large – CT scans, MRI can reach GB sizes. To overcome this limitation, FORTYPE will provide a built-in file-sharing option, automatically activated when the attachment is above a certain value. By simply attaching a large document, it will be placed in the file-sharing location and the message will contain a link that will allow the relevant users to download it when needed. 3. Temporary enrollment of external users. While FORTYPE is designed as an internal instant messaging platform, we had to recognize the fact that in certain business scenarios, like the need to quickly pull together recent medical information for a patient, relevant data sits with external users. They need to be quickly enrolled in order to allow them to securely share the information they hold, be it medical records or imaging files. With this in mind, we created an external user enrollment process initiated via an SMS sent to the external user’s phone number. 4. Configurable user rights. This functionality came as a logical consequence of enrolling external users. For example, when external users are part of an ad-hoc group focused on pulling together all relevant medical information for a patient, the external users should be able to visualize only the attachments they provided and not the attachments shared by other users. This is required in order to avoid unwanted disclosure of confidential medical information. Obviously, the internal users in charge with gathering all relevant medical documents have the right to access all attachments provided by internal and external users. We believe that Design Thinking, centered on users’ needs and on defining detailed interaction scenarios, is extremely valuable in determining new products within fields that seem crowded or saturated. If we take a closer look at a business, we will come to the realization that the lines between product/services and user environments are blurring. If companies can bring out an integrated customer experience, it will open up opportunities to build new businesses. This project is co-financed with funding from the European Regional Development Fund - Priority Axis 2 - “Information and Communication Technology (ICT) fit a competitive digital economy”, through the Operational Competitiveness Program 2014-2020. The content of this material does not necessarily represent an official position of the European Union or the Government of Romania.
Fishpointer, the first social-interactive map for fishing, is a 100% Romanian product that will address the international public which is estimated at over 150 million fishing enthusiasts connected to the Internet. The Fishpointer application combines interactive electronic maps and socialization tools that allows users to record their own catch data, generate personal statistics, and socialize on fishing experiences. The app is available for free on Google play, AppStore and WebApp. https://play.google.com/store/apps/details?id=com.fishpointer https://itunes.apple.com/ro/app/fishpointer/id977021201?l=ro&mt=8
Maponia - user generated content app dedicated to professional drivers, will change the transportation pattern of the merchandise in Europe. Offering drivers power, autonomy and better transportation by efficient planning, safer route and more pleasant driving experience, using the power of community. Available for free on Google play, AppStore and WebApp. www.youtube.com/embed/RyIt2CRRJlo https://itunes.apple.com/ro/app/maponia/id1342623172?mt=8 https://play.google.com/store/apps/details?id=com.maponia
16:50 - 17:20
Placeholder
Florin Micu -
Robotics in Automotive Validation

Speaker

17.20 - 18:00
Placeholder
MIHAI RANETI -
Automotive & The World of Tomorrow

Speaker

09:45 - 10:00
Intro Speech – Softvision

Speakers Panel

Placeholder
Iulian Hostiuc

Speaker

Placeholder
Stefania Ioana Chiorean

Moderator

09:45 - 10:00
How the paradigms have changed

Speakers Panel

Placeholder
Radu Negulescu

Speaker

Placeholder
Victor Radu Gradinescu

Moderator

09:45 - 10:00
Intro Speech – RINF TECH

Speakers Panel

Placeholder
Ioana Balasa

Speaker

Placeholder
Richard Lewington

Moderator

09:45 - 10:00
Intro Speech – Product lifecycle management

Speakers Panel

Placeholder
Bogdan Cazangiu

Speaker

Placeholder
Cristian Orasanu

Moderator

09:45 - 10:00
Intro Speech – LSEG

Speakers Panel

Placeholder
Ann Neidenbach

Speaker

Placeholder
Vitaly Friedman

Moderator

09:45 - 10:00
Intro Speech – Metro Systems

Speakers Panel

Placeholder
Irina Poiana

Speaker

Placeholder
Grosu Andrei-Nicolae

Moderator

09:45 - 10:00
Placeholder
Alex Bordei -
Intro Speech

Moderator

10:00 - 10:45
Placeholder
Radu Vunvulea -
TOOLS AND COMPETENCES ON DEVOPS

Speaker

10:00 - 10:45
Placeholder
Alexandru Bolboaca -
Thinking in functions in C++

Speaker

10:00 - 10:45
Placeholder
Daniel Appelquist -
Why We Need a more Ethical Web

Speaker

10:00 - 10:45
Placeholder
Daniel Zacarias -
Priority starts at the top

Speaker

10:00 - 10:45
Placeholder
Matt Jarvis -
Dealing with Kubesprawl – Tetris style!

Speaker

10:00 - 10:45
Placeholder
Audrey Tang -
Digital Social Innovation

Speaker

10:00 - 10:45
Placeholder
Victor Rentea -
The Proxy Fairy and the Magic of Spring

Speaker

10:50 - 11:35
Placeholder
Adrian Neatu -
Kubernetes in the Cloud(s)

Speaker

10:50 - 11:35
Placeholder
Mihai Tudor -
Trading systems in the age of speed and the return of the Field Programmable Gate Arrays (FPGAs)

Speaker

10:50 - 11:35
Placeholder
Heather Wilde -
Choice is Overrated – Designing Products That Know What You Want Before You Do

Speaker

10:50 - 11:35
Placeholder
Josep Panadero -
Managing sport betting risk at a scale with the in-memory data grid Hazelcast

Speaker

10:50 - 11:35
Placeholder
Daniel Radu -
Processing files with Kafka and Cassandra at cloud scale

Speaker

10:50 - 11:35
Enhanced Reality – A Practical Perspective & MATT – from sketch to a fully functional testing robot.

Speakers Panel

Placeholder
Mihai Craciunescu

Speaker

Placeholder
Dragos Barbulescu

Speaker

11:40 - 12:20
Placeholder
Lucian Oprea -
Reactive Microservices In Practice

Speaker

11:40 - 12:20
Placeholder
Laurent Gendrier -
Traditional BI or Big Data?

Speaker

11:40 - 12:20
Placeholder
Rafal Czuprynski -
Digital transformation, AI – what the hell does it mean for me?

Speaker

11:40 - 12:20
Placeholder
Idan Tovi -
Metrics in Distributed System – From Zero to Hero

Speaker

11:40 - 12:20
Placeholder
Richard Gratton -
Stop Asking How Long; Ask How Big

Speaker

11:40 - 12:20
Placeholder
Matthias Schulze -
DriveCore, Visteon‘s platform for ADAS and Autonomous Driving.

Speaker

12:25 - 13:00
Placeholder
Marcin Szymaniuk -
DataFrames in Spark – the analysts perspective

Speaker

12:25 - 13:00
Placeholder
Matthew Renze -
Artificial intelligence: The future of software

Speaker

12:25 - 13:00
Placeholder
Patrick Balulescu -
Smart Car > 5G & beyond

Speaker

12:25 - 13:00
Placeholder
Steve Poole -
An Open Future for Java in the Cloud

Speaker

12:25 - 13:00
Placeholder
Erwin Staal -
DevOps – The Automation of Compliance

Speaker

12:25 - 13:00
Placeholder
Vladimir Grinenko -
Balanced development in large teams

Speaker

13:00 - 14:00
Networking Lunch
13:00 - 14:00
Networking Lunch
13:00 - 14:00
Networking Lunch
13:00 - 14:00
Networking Lunch
13:00 - 14:00
Networking Lunch
13:00 - 14:00
Networking Lunch
14:05 - 14:45
API Development in a Microservices World & Automating SD-WAN Deployment in a Virtualized Cloud Environment

Speakers Panel

Placeholder
Bogdan Meca

Speaker

Placeholder
Andrei Chițu

Speaker

14:05 - 14:45
Placeholder
Adam West -
Demystifying the hype: AI and it’s future in business

Speaker

14:05 - 14:45
Placeholder
Victor Ionescu -
Finding your sweet spot on the cloud computing continuum

Speaker

14:05 - 14:45
Placeholder
Antonio Ferreira -
Blockchain 2.0: from Automotive 4.0 » to Mobility 3.0 » to Smart City 2.0

Speaker

14:05 - 14:45
Placeholder
Andrei Mihaescu -
Microservices in 30 minutes or less

Speaker

14:05 - 14:45
Placeholder
Sergio Pereira -
A Software Developer’s path to Blockchain

Speaker

14:50 - 15:35
Placeholder
Tom Dunlap -
Aligning a Data Policy Framework with Technology Enablers: Best Practices and Pitfalls

Speaker

14:50 - 15:35
More than coding: insights on emerging technology from major players

Speakers Panel

Placeholder
Gabriel Diaconescu

Speaker

Placeholder
Alin Butnarasu

Speaker

14:50 - 15:35
Placeholder
Lucian Moroeanu -
Operational Excellence, from zero to prod and beyond

Speaker

14:50 - 15:35
Placeholder
Marcus Biel -
Java, Turbocharged

Speaker

14:50 - 15:35
Placeholder
Sébastien Stormacq -
Continuous Integration & Continuous Deployment for your containers and serverless applications

Speaker

14:50 - 15:35
Placeholder
Laurentiu Matei -
Keep your product and your team sane

Speaker

15:40 - 16:20
Cross-GPU Machine Learning for Video Games

Speakers Panel

Placeholder
Ciprian Păduraru

Speaker

Placeholder
Alexandru-Marian Atanasiu

Speaker

15:40 - 16:20
Placeholder
Alexandra Petrus -
AI for Executives and Product Leaders

Speaker

15:40 - 16:20
Placeholder
Stephan Hochdörfer -
From dev to prod with GitLab CI

Speaker

15:40 - 16:20
Placeholder
Georgi Kodinov -
MySQL Enterprise Data Masking

Speaker

15:40 - 16:20
Placeholder
Karina Popova -
IoT & Agriculture: Precision Farming Challenges

Speaker

10:50 - 11:35
Placeholder
Gerrit Grunwald -
MULTI DEVICE CONTROLS – A DIFFERENT APPROACH TO UX

Speaker

11:40 - 12:20
Placeholder
Petrache Marin -
Building VueJS applications with Typescript

Speaker

12:25 - 13:00
Placeholder
Stefania Ioana Chiorean -
Keynote

Speaker

13:00 - 14:00
Networking Lunch
14:05 - 14:45
Placeholder
Alexandra Petrea -
3 ways to reusable ReactJS applications

Speaker

14:50 - 15:35
Roablocks on becoming an expert web developer

Speakers Panel

Placeholder
Tamás Tompa

Speaker

Placeholder
Attila Molnár

Speaker

Placeholder
Mátyás Fórián Szabó

Speaker

15:40 - 16:20
Placeholder
Holger Bartel -
Make Cache Control Work for You

Speaker

16:25 - 17:05
Placeholder
Vladimir Grinenko -
C – consistency or Code reuse in really large projects

Speaker

16:25 - 17:05
Placeholder
Cédric van Beijsterveldt -
Machine Learning: not as new as it seems

Speaker

16:25 - 17:05
Exploring New Technologies In Modern Architectures

Speakers Panel

Placeholder
Răzvan Simion

Speaker

Placeholder
Dan Ștefănescu

Speaker

16:25 - 17:05
Placeholder
Sabin Popa -
Surviving in a hybrid cloud world

Speaker

16:25 - 17:05
Placeholder
Catalina Banuleasa -
How do we bridge people and technology through design

Speaker

15:40 - 16:20
Placeholder
Ivar Grimstad -
Microservice Patterns – Implemented by Eclipse MicroProfile

Speaker

16:25 - 17:05
Placeholder
Eliska Slobodova -
Making teams work effectively

Speaker

17:10 - 17:45
Placeholder
Viral Parmar -
WEBVR : BEYOND IMAGINATION

Speaker

16:25 - 17:05
Placeholder
Ties van de Ven -
Concepts that improve the way you think about code

Speaker

17:10 - 17:45
Placeholder
Viral Parmar -
WEBVR : BEYOND IMAGINATION

Speaker

17:10 - 17:40
Placeholder
James Murphy -
Mindset over skillset

Speaker

Emerging tech
Digital Transformation
Mobile
QA&Testing
SAP
Security
Metrics in Distributed Systems

Would you like to find out more about "Metrics in Distributed Systems"? Idan Tovi, Head of SRE at PayU,  shares with us more about it! "Collecting metrics in a distributed system can be a real challenge, especially if the system wasn’t designed in advance. The number of different repositories and technologies, the way to collect the metrics, the number of different metrics, dashboards and alert creation are just some of the challenges. Still, we managed to overcome those challenges in less than 3 months! In this article I will explain how.

How it All Started

The system we had in place served our company’s customers well, but had started to become overloaded. So we decided to build a new system, starting with a minimum set of principles:
  • API first
  • SMACK (Spark, Mesos, Akka, Cassandra and Kafka)
  • Full automation
  • Horizontal scale
  • Small teams
The advantage of this approach that we now had a team of 20 talented engineers searching/investigating/uncovering the microservices landscape in a high pace. The downside, however, was that we didn’t have any standardization regarding observability. So we knew we needed monitoring tools. We already had logs, but quickly we understood that we needed another level of observability except for that. So we decided to focus on metrics.

The Initial Approach

The technology stack we decided to use was Sysdig, which is a very nice commercial metrics solution, together with an open source package written by me, and oh what a bad package it is… :) The package collected too many metrics and also caused a nice memory leak. The fact that everyone on the engineering team wanted custom metrics for their own service, led to a huge number of metrics. This caused massive load on the Sysdig system which was no longer responsive and eventually not useful. Besides the system was still in its early stages with no real customers so who needs metrics anyway right?!? Logs were enough for now, so we accepted the downside of using Sysdig.

Taking it a Step Further

A year later the system started to take shape. The traffic started to shift from the legacy system and we kept adding more and more microservices. Just to give you a sense of what our stack included back then: DC/OS, LinkerD, Kafka, Cassandra, Druid, Spark, Elastic Search, Consul, Vault, 4 different programming languages and everything was dockerised and based on AWS. At this stage, our Infra team felt like they must have some better way to monitor this growing stack and they decided to give metrics a second chance. This time we decided to go with InfluxDB. We started to collect the infrastructure metrics and then asked the developers to add some services metrics  However, Influx didn’t take it well. When we started to add the service metrics, Influx doesn't handle large numbers of time series well. Still, we weren’t yet using the full potential of the system and we had a limited number of services, so by making a couple of improvements to the application logs we didn’t feel the lack of metrics and Influx gave us some mid term solution for the infrastructure. We knew this could not scale but had more urgent things to handle. And indeed, as another year passed, we started to onboard larger customers. The load on the system was growing fast and logs were no longer enough.

Giving Metrics Another Try

As always, first we chose the technologies. We knew Prometheus become part of the CNCF (Cloud Native Computing Foundation) and because our system is a cloud-native one we thought we should give it a try. As part of this choice we had to re-write our open source package, so we decided to write it from scratch and learn from our mistakes. We decided to expose by default much fewer metrics and only valuable ones. Then we thought about what else we should do different, in order to succeed this time?!? And then we came up with the most important piece of the puzzle - we set up the SRE team, which took responsibility for the challenge. So for the first time, we had a team that was accountable for the observability of the system. But how should we handle more than 200 different services and a growing tech stack??? The approach we took, is to divide the services into groups. So we started to think about how to group our services,  by summarizing the 3 most common practices of what you should monitor:
  • RED (Rate, Erros, Duration) - which is more focused on the application
  • USE - Utilization, Saturation, Error - which is a better fit for infrastructure
  • 4 Goldan signals (Latency, Traffic, Errors, and Saturation) 
We figured out that our services can be categorized by their functionality. So we defined our groups accordingly. Each group contains different metrics that we should monitor. In addition, each service can be part of more than one group; for example, a service with an API that also produces events to Kafka and is written in Scala, will be part of the 3 groups. Of course, this approach is flexible and we can always add more and more groups as we grow and as our tech stack evolves. The groups we choose to start with are:
  • General metrics like memory and CPU usage
  • Programming language specific like Node.js event look, JVM metrics and Golang goroutines.
  • Message broker Consumers
  • Message broker Producers
 

Taking an ‘Everything as Code’ Approach

We love think of everything as code. We automate everything, even our documentation site is automated (not just the deployment but even some of the content is automatically generated). We decided to take the same approach with our dashboards and alerts. This is crucial because it helps developers embed this step in their pipeline as part of the provisioning. It also makes things deterministic, since we will always create the same alerts/dashboards for every service within the group. In addition, it makes our dashboards and alerts reproducible in case of a major incident to our monitoring system. And last but not least, it makes collaboration easy: one of our biggest lessons in how to not block your engineering team from scaling up either in number of people or number of technologies is to make it easy to collaborate - so every engineer is more than welcome to add more or improve an existing group.

The Netflix Paved Road Culture

So far, we have technology, a team, a good approach  for grouping our services, a clear understanding of what we should monitor for each group and an easy way to start monitoring and contributing. But how should we bring this all together for our engineers. Netflix Paved Road culture was the answer to this question. In a nutshell, it describes the relationship between Netflix teams (consumers and producers). The idea is to give freedom to the engineers, but also help them to focus on their main business logic. We thus decided to build three metrics packages to collect metrics very easily from our application and we also created default panels for each group mentioned earlier. Those panels can be used via a simple CLI tool which is also packed in a docker container and makes it very easy to add a step in every service pipeline to add dashboards in Grafana and Alerts in the AlertManager. It actually can take less than 10 minutes to add and provision metrics, dashboards and alerts for a service. In addition, from now on every new technology we will use that fits into an existing group has minimum requirements regarding the metrics it should expose. If there’s a technology which has no suitable group, everyone knows how to add it to our metrics ecosystem. So far so good for application metrics, but what should we do with an infrastructure that is more unique and harder to divide into groups The answer is the same as we did with our services - EaC is the guiding principle. We added an automated way to upload full dashboards as part of the infrastructure provisioning/configuration pipeline (the same CLI tool for the apps). The main difference in this case, is that we upload a full dashboard. Using those two main approaches and the right technologies, we managed to resolve a lack of observability of a pretty large technology stack and a lot of different services in less than 3 months, but obviously we weren’t able to complete that without learning our lessons from our previous failures.

In Summary

So to recap, those are the main takeaways I think everyone should take from our experience:
  1. As everything in a distributed system, it is always harder than with a monolith.
  2. If you can, design your metrics from the beginning. It is important to have metrics for observability and production debugging, but also in order to know you meet your SLOs.
  3. Either way, if you are closing the gaps like us or designing from the beginning, EaC and the Paved road are amazing principles and in my opinion, the only way to scale either your system and your engineering.
  4. Choose the right technologies for your type of system. Prometheus is amazing for cloud native distributed system, but there are a lot of great solutions out there that might fit you system better.
  5. Make someone responsible and accountable for your system production observability, otherwise it will pushed back in priority until it becomes critical and critical can be too late.
  6. Use the community, there is a huge amount of knowledge already built in the community - SRE is not something specific to Google anymore.
  7. And never give up - as we believe “fail fast, learn fast”."
 
More Efficient Trainings for Air Traffic Control Professionals with the Help of New Technologies

Training air traffic control experts (ATC) is a domain that belongs to the public safety industry where adoption of latest technologies proves to be the right choice for getting better risks reduction. Adopting mobile, cloud, speech recognition and artificial intelligence technologies in the process of training the air traffic controllers, makes it much more accessible in terms of technical equipment and, of course, affordable. The public safety industry, despite its huge dependency on technology, is one of the latest adopters of the latest technological advancements, given its need to use the most reliable and mature technology available. This principle can easily be seen in the usage of analogous communications despite 5G being deployed in commercial and civil environments. Current Air Traffic Simulators, used as the foundation of ATC experts training, are complex systems that replicate voice communications, RADAR consoles, meteorological information systems, runway lights controls, surface movement radar consoles and, of course, a 3D replica of the out of control tower view. All these conditions have a huge impact on the costs of ATC training infrastructures, making it a premium facility, and not all ANSP (air navigation service providers) may afford it. This requires multiple ATC to get trained in simulators abroad and have this experience as a premium, rather than an accessible infrastructure they could use at any point in their career. ATC simulators need to replicate 100% operational systems’ functionality, but do not need to use 100% the same technology as an operational system. We have developed a platform that maps 100% operational functionality, but relies on a set of latest/advanced technologies as a foundation that brings new horizons to the ATC training options. The first layer of the new technologic foundation is an entire new scenario engine that includes complex management and rendering capabilities, has multiple interoperability end points and allows diverse scenario management. The second layer build on top of the foundation is the presentation layer that has been abstracted and made mobile first. This allows the deployment of the simulator on less complex hardware such as commercially available tablets. All components of the simulator are now running on tablets, making the ATC Simulator accessible in general mobile app stores and on affordable devices. The third layer of new technologies used in the simulator platform is comprised by three of the latest technologies: voice recognition as support of the input options, Augmented Reality to deliver immersive experiences, and respectively Artificial Intelligence engines used to learn ATC behaviour during crisis scenarios. Introduction of new technologies in the rigid industry of public safety generates an impressive set of advantages in regards to ATC training options. The first advantage comes from the new business model that is enabled by the availability of the training and simulation platform on mobile devices. The heavy and complex infrastructures are no longer required to test your ATC abilities and to run complex emergency scenarios. All can be now done from your tablet using an app downloaded from your traditional app store. This builds a new business model, ATC training as a service, and the availability of it increases dramatically, adding to the pool of ANSP customers (two-three hundred), the large pool of ATC (300-400 K worldwide) that can procure the platform directly. Adding a new perception dimension to the training process is just as important: using augmented reality for the traditional 2D space of RADAR consoles. Now, given the AR capabilities of recent tablets on the market, we are able to deliver Augmented Reality regarding the evaluation process of a particular ATC exercise. Another innovation added to the platform is its shared cloud infrastructure. While in the traditional model, each training centre would be able to use just its own training scenarios, now, using cloud technologies, multiple training centres can openly share their scenarios and approach to solve them, so collaboration gets a new definition among ATC specialists, and latest techniques to ATC are now being published faster for a global audience. This work is part of the project Innovative informatic instrument for the training and testing of Air Traffic Controllers, developed by SC Sim Soft Distribution SRL company. This project is co-financed through the European Regional Development Fund – Priority Axis 2 – “Information and communication technology for a more competitive digital economy” by the Operational Programme “Innovations and Competitiveness” 2014-2020. The contents of this material do not necessarily represent the official position of the European Union or of the Romanian Government.
The Future of Instant Messaging for Organizations – A Design Thinking Approach

The Future of Instant Messaging for Organizations – A Design Thinking Approach With GDPR applicable since May 2018, many companies started taking decisive measures against “shadow” usage of messaging apps. Continental banned the usage of WhatsApp and Snapchat on company mobiles. Deutsche Bank went even further, by banning instant messaging as well as all messaging apps. However, the need for instant messaging to exchange critical business data in real time remains, especially considering the fast paced business environment, as well as the customer expectations to gain access to the business and government services as quickly as possible. After having discussions with several clients, many of them dealing with sensitive and private data (e.g. medical services providers), they point out that due to the dispersal of relevant data and information, instant messaging remains essential in gathering and propagating vital data in real-time. At the same time, managers want to be in control of who accesses the data they exchange. This means using the best encryption available (end-to-end), as well as hosting the messaging solution in their private cloud. To fully cover the security concerns, an MDM (Mobile Device Management) component has to be part of the solution. GDPR compliance is a must. For increased usability, integration with enterprise apps is also required. With these requirements in mind and with a strong desire to provide the client with an app that ticks all the boxes, the Romanian company Trencadis is developing the FORTYPE messaging app. As a recognition of the fact that stepping out of the “shadow IT” zone means not only banning existing apps, but also providing companies and governments with a proper solution, the EU awarded a grant for building FORTYPE, product developed within the project TALOS - Secure Mobile Intra-organizational Communication, implemented by Trencadis Corp. We have brought together the flexibility of the Design Thinking approach, centered on end-users, with MDM solution for administrators, and came up with four core features that were not considered before when taking into account the “bread and butter” functionalities of typical instant messaging apps. 1. In-app document scanning. Focusing on specific use cases from the medical domain, we learned that sometimes physical documents need to be sent via instant messaging for rapid sharing. This is possible by using device specific functionalities (like taking pictures, then sending them), but this is a rather lengthy and error prone process, especially when we discuss about multi-page documents. The in-app functionality enables multi-page scanning, followed immediately by sharing with the relevant users. 2. Large files sharing. Most instant messaging apps impose reasonable limitations for the message attachment (e.g. 100 MB limitation in WhatsApp). However, when we talk about medical files, these can be rather large – CT scans, MRI can reach GB sizes. To overcome this limitation, FORTYPE will provide a built-in file-sharing option, automatically activated when the attachment is above a certain value. By simply attaching a large document, it will be placed in the file-sharing location and the message will contain a link that will allow the relevant users to download it when needed. 3. Temporary enrollment of external users. While FORTYPE is designed as an internal instant messaging platform, we had to recognize the fact that in certain business scenarios, like the need to quickly pull together recent medical information for a patient, relevant data sits with external users. They need to be quickly enrolled in order to allow them to securely share the information they hold, be it medical records or imaging files. With this in mind, we created an external user enrollment process initiated via an SMS sent to the external user’s phone number. 4. Configurable user rights. This functionality came as a logical consequence of enrolling external users. For example, when external users are part of an ad-hoc group focused on pulling together all relevant medical information for a patient, the external users should be able to visualize only the attachments they provided and not the attachments shared by other users. This is required in order to avoid unwanted disclosure of confidential medical information. Obviously, the internal users in charge with gathering all relevant medical documents have the right to access all attachments provided by internal and external users. We believe that Design Thinking, centered on users’ needs and on defining detailed interaction scenarios, is extremely valuable in determining new products within fields that seem crowded or saturated. If we take a closer look at a business, we will come to the realization that the lines between product/services and user environments are blurring. If companies can bring out an integrated customer experience, it will open up opportunities to build new businesses. This project is co-financed with funding from the European Regional Development Fund - Priority Axis 2 - “Information and Communication Technology (ICT) fit a competitive digital economy”, through the Operational Competitiveness Program 2014-2020. The content of this material does not necessarily represent an official position of the European Union or the Government of Romania.
Fishpointer

Fishpointer, the first social-interactive map for fishing, is a 100% Romanian product that will address the international public which is estimated at over 150 million fishing enthusiasts connected to the Internet. The Fishpointer application combines interactive electronic maps and socialization tools that allows users to record their own catch data, generate personal statistics, and socialize on fishing experiences. The app is available for free on Google play, AppStore and WebApp. https://play.google.com/store/apps/details?id=com.fishpointer https://itunes.apple.com/ro/app/fishpointer/id977021201?l=ro&mt=8
Maponia

Maponia - user generated content app dedicated to professional drivers, will change the transportation pattern of the merchandise in Europe. Offering drivers power, autonomy and better transportation by efficient planning, safer route and more pleasant driving experience, using the power of community. Available for free on Google play, AppStore and WebApp. www.youtube.com/embed/RyIt2CRRJlo https://itunes.apple.com/ro/app/maponia/id1342623172?mt=8 https://play.google.com/store/apps/details?id=com.maponia
09:45 - 10:00
Intro Speech – EY

Speakers Panel

Placeholder
Veronica Stefan

Moderator

Placeholder
Cristian Carstoiu

Speaker

09:45 - 10:00
Intro Speech – LSEG

Speakers Panel

Placeholder
Andreea Stanescu

Speaker

Placeholder
Colin Whitfield

Moderator

10:00 - 10:45
Placeholder
Håkan Silfvernagel -
Affective Computing – What is it and why should I care?

Speaker

10:50 - 11:35
Placeholder
Eduard Dumitrașcu -
The Future of Cities | Shaping the Modern City

Speaker

11:40 - 12:20
Blockchain: a road to the future

Speakers Panel

Placeholder
Florin Otto

Speaker

Placeholder
Dan Popescu

Speaker

12:25 - 13:00
Placeholder
Henk Boelman -
A.I. on the Microsoft Stack

Speaker

13:00 - 14:00
Networking Lunch
14:05 - 14:45
Placeholder
Sébastien Stormacq -
Add intelligence to applications with AWS Machine Learning Services

Speaker

14:50 - 15:35
Placeholder
Silviu-Tudor Serban -
The true potential of AI for well-being, safety and privacy

Speaker

15:40 - 16:20
Placeholder
Roy van Rijn -
From TicTacToe to AlphaGo

Speaker

09:45 - 10:00
Intro Speech

Speakers Panel

Placeholder
Monica Alexandru

Moderator

Placeholder
Dev Speaker

Speaker

09:45 - 10:00
The world without testing

Speakers Panel

Placeholder
Florin Manolescu

Moderator

Placeholder
Stefan Radulescu

Speaker

09:45 - 10:00
Intro Speech – Deloitte Digital

Speakers Panel

Placeholder
Monica Zara

Moderator

Placeholder
Cornel Coser

Speaker

10:00 - 10:45
Placeholder
Raftery Tom -
Our digital future – we can optimize for people, planet, and profit :)

Speaker

10:00 - 10:45
Placeholder
Georgi Kodinov -
What’s new in MySQL 8.0 security ?

Speaker

10:00 - 10:45
Placeholder
Nell Watson -
Magnanimous Machines

Speaker

10:00 - 10:45
Placeholder
Kristel Kruustük -
Why should I Care About Testing?

Speaker

10:50 - 11:35
Placeholder
Daniel Trandafir -
Embracing change in enterprise application development

Speaker

10:50 - 11:35
Live Hacking Session – Have you ever wondered how an attacker can gain control of your IT system?

Speakers Panel

Placeholder
Mădălin Dumitru

Speaker

Placeholder
Răzvan Cernăianu

Speaker

10:50 - 11:35
Placeholder
Cristian Carstoiu -
How to convince any CEO to invest in your digital idea

Speaker

10:50 - 11:35
Placeholder
Shai Liberman -
Testing for AI Vs AI for testing

Speaker

11:40 - 12:20
Placeholder
Andrei Stoica -
MVC Pattern in Test Automation

Speaker

11:40 - 12:20
Placeholder
Iulian Paduraru -
Level up your Agile Transformation

Speaker

11:40 - 12:20
Placeholder
David Heinzer -
Strategic tools for digital transformation at Edenred

Speaker

11:40 - 12:20
Placeholder
Erik Hajnal -
Security is fun!

Speaker

12:25 - 13:00
Placeholder
Stephan Gerling -
Remote yacht hacking – status quo

Speaker

12:25 - 13:00
Placeholder
Bogdan Lucaciu -
The technology behind bras

Speaker

12:25 - 13:00
Placeholder
Octavian Rosca -
From ABAP to BW and BPC

Speaker

12:25 - 13:00
Placeholder
Karina Popova -
IoT & Medicine: Improving the Science of Medicine

Speaker

13:00 - 14:00
Networking Lunch
13:00 - 14:00
Networking Lunch
13:00 - 14:00
Networking Lunch
13:00 - 14:00
Networking Lunch
14:05 - 14:45
Placeholder
Emma Grasmeder -
Continuous Intelligence: Leveraging Data Science with Continuous Delivery

Speaker

14:50 - 15:35
Placeholder
Radu Lacatus -
From distributed ball of mud to clean architecture – where to start –

Speaker

15:40 - 16:20
Placeholder
Gary Crawford -
Models of transformation

Speaker

14:05 - 14:45
Building cloud-native extensions around the SAP core: 2 years of experience shared

Speakers Panel

Placeholder
Helmut Königseder

Speaker

Placeholder
Victor Ionescu

Speaker

14:05 - 14:45
Placeholder
Eduard Mirescu -
AI-Empowered QA Communication

Speaker

14:05 - 14:45
Placeholder
Sorin Radulescu -
The many faces of cloud security

Speaker

14:50 - 15:35
Placeholder
Gabriel Balaes -
Performance testing for fashion

Speaker

14:50 - 15:35
When connected devices start a rIoT

Speakers Panel

Placeholder
Andrei Petrus

Speaker

Placeholder
Tatiana Petrache

Speaker

14:50 - 15:35
Placeholder
Maximilian Streifeneder -
Who took my servers?

Speaker

15:40 - 16:20
Placeholder
Jan Kopriva -
What happens when we fail to think about security

Speaker

16:25 - 17:05
Placeholder
Serban Bejan -
DevSecOps Use Case: Squeeze Security Into Your Pipeline

Speaker

15:40 - 16:20
Placeholder
DevSpeaker -
Keynote Powered by SAP

Speaker

15:40 - 16:20
Placeholder
George Stan -
Agile Testing. Agile Mindset

Speaker

09:45 - 10:00
Intro Speech – Games of Crossplatforms

Speakers Panel

Placeholder
Ostap Andrusiv

Moderator

Placeholder
Olexii Ratych

Speaker

10:00 - 10:45
Placeholder
Ostap Andrusiv -
Mobile App Marketing for Developers

Speaker

10:50 - 11:35
Placeholder
Stanislav Sorochich -
The pain of PWA: launching heavy web games on mobile

Speaker

11:40 - 12:20
Placeholder
Alex Gherghina -
How to augment your reality

Speaker

12:25 - 13:00
Placeholder
Ciprian Caba -
Innovating the digital mobile advertising industry

Speaker

13:00 - 14:00
Networking Lunch
14:05 - 14:45
Placeholder
Vitaly Friedman -
Smarter Mobile Interface Design Patterns

Speaker

14:50 - 15:35
Placeholder
Magda Miu -
Workout your tasks with WorkManager

Speaker

15:40 - 16:20
Placeholder
Trishul Goel -
PWAs on steroids

Speaker

16:25 - 17:00
Placeholder
Cristian Orasanu -
Crypto, Blockchain, Lambos & other Buzzwords. Crypto people wear suits now.

Speaker

Partners