« לעמוד הראשי
האזינו: מדוע Taboola עברו מ- Oracle Java ל- Zing
פורסם ב 27 בספטמבר 2020 ¬ 9:00Tamir Gefen
לאחרונה צפיתי בוובינר מעניין של חברת Azul, אותה אנו מייצגים, בשיתוף עם VP IT מחברת Taboola .
הוובינר יעניין מי שעוסק בסביבות production שמריצות Java ו- JVM .
בוובינר שוחחו Simon Riter – סגן ה CTO מחברת AZUL , ואריאל פיזצקי מחברת Taboola.
אריאל סיפר מדוע הם משתמשים ב- Zing לשיפור ביצועי Java / JVM (כולל שקפים של ביצועים לפני ואחרי); המליץ על המוצר והסביר מדוע עברו מתמיכת Java של Oracle .
כמה נקודות מעניינות שהוזכרו במהלך השיחה:
- Taboola עוסקת בהרבה big data:
- 3.2 מליארד דפי אינטרנט ביום
- 1.4 מליארד משתמשים (שונים) בחודש
- מספקים 30 מליארד המלצות יומיות
- מחזיקים 8500 שרתים ברחבי העולם
- 30 מליארד שורות לוגים מדי יום
- דקה 15: ביצועי CPU לפני שימוש ב ZING ולאחריו
(תמונה בהמשך)
רואים ש- Zing שיפר משמעותית את הביצועים
- דקה 25: קל לעשות מיגרציה לזה
- אריאל ציין שהם עובדים עם Zing כבר 3 שנים
- לפני Zing השתמשו ב- G1 (סוג של Garbage Collector) . זה היה לפני 4 שנים
- בסוף הוובינר ציין שיש משהו שהם יכולים לעשות רק עם Zing
- הם משתמשים ב- Grafana, Prometheus, Cassandra, Kafka, pull data with local agents
לנוחיותכם הוספנו את התמליל המלא.
לצפייה מדוע עברו ל- Zing :
Transcription:
Good morning, good afternoon or good evening depending on where you are and welcome to this webinar.
Today we're going to be talking about why Taboola switched from Oracle to Azul Java
…
So my name is Simon Ritter i work as the Deputy CTO of Azul systems
and I'm joined today by Ariel Pisetzky who is the VP IT at Taboola
I'll let Ariel introduce himself and start talking about Taboola
[Ariel] Thank you very much Simon
I'm Ariel i'm with Taboola and I would like to start with talking a bit about Taboola
and then talk a lot about what we do with with Azul
specifically with Zing and why it was
why it was helpful for us
so just a bit about Taboola: Taboola is a content discovery platform
we are that company that helps you find content that you may like and you never knew is
when you browse or surf the web and you
are at the bottom of an article or maybe
midsection maybe on the right rail and
boxes of content that you may like
that is us and we really help people
interesting and new things at the moment
engaged what we like to call those
moments of next when you are leaning in
when you are looking at your screen you
are consuming content and you wish to
article the next interesting thing the
next thing that is relevant
for you and I've said for you
twice now and the main driver for that
the personalization every recommendation
is personalized that means that if Simon
and i were to browse the same page
of the same publisher we would not
actually be seeing the same
recommendations the recommendations
for Simon the recommendations for me
now we do this on multiple digital
places that you've seen online places
and to the extent that we now provide
two to three to four billion web pages a
day depending on the day depending on
um obviously in the time of covid for
saw a huge surge over the first month
during march and a bit into april
where every every one of us was online
looking for more news trying to
is happening around us how the world has
changed from then that number has now
normal I'd say browsing kind of
but still it is really interesting to
ebb and flow and how we provide that
moment of next to every one of you
generally speaking about numbers we see
1.4 billion unique users every month we
about 1.5 where we actually serve 1.5
billion clicks which means that
beyond those billions of recommendations
every day people also of course click
on those recommendations and we need to
that recommendation and provide all the
of that click which as a atom
kind of operation would seem very benign
scale is actually quite a big challenge
um in terms of kind of compute
you would see that on average 3.2
a day 30 billion recommendations
million queries per second on our system
uh aggregate fully aggregated load so if
amount of just requests coming in at
that is a interesting number and of
for every recommendation we will have a
line of log for every server we will
have metrics we will have
a lot of data that we need to pull in
a lot of different events that have
for different people within the
and that's big data isn't it
that is that is big data in terms of um
variety in terms of uh volume
and in terms of um i'd say the velocity
and if you are kind of looking at what
we did and this is the immediate first
to immediately understand how zing
so maybe before i talk about this graph
for a second i'll just say that we are a
uh our application is java based our
is java based we have other applications
java based and we have multiple
have merged into tabula with the years
so we see a whole lot of impact
every time we can optimize anything
on the specific java platform
and here this is like a plain vanilla
this is a community version Cassandra
real data from taboola from our clusters
and this is like the first view where
you can see that point in time
enough of zing installed on our
where we moved from a higher level of
to a lower level of latency in all
of the measurements not only in the p50
but as you can see here in the red line
so it's really amazing to see how we
runtime of the application by actually
so the Cassandra now on really hundreds
is not served with the plain old jvm
but with the azul zing jvm saving us
on latency and allowing us to to better
and uh a wonderful thing yeah so as i
i i won't stop you so i'll jump in after
sorry okay a wonderful thing that you
can see here is the flattening of the
averages and even of the p95 i mean
the p99 is still very noisy the red line
still very noisy but the blue line the
you can see is like totally flat so 95
of the requests coming into this
have actually been totally flattened and
is due to the job of the i'm sorry the
being handled properly via the azul
yes simon you were you were what you
i was just going to add a little bit of
detail there because clearly um as you
what's happening there is that the
garbage collection which was interfering
with what was happening from the
and actually being able to deliver the
results that you're looking for what we
is to essentially eliminate the garbage
collection pauses by doing it
concurrently with the application
and that that's a really big difference
because we're running the garbage
simultaneously with the application so
the the queries you're able to return
whilst we're doing the garbage
collection in the background and
the way that we do that is by using a
so that in terms of um every time you
what we can do is we can ensure that you
can do that safely so if we're doing
we always make sure we mark the object
before we give it to you if we're
actually moving objects around within
the heap which we actually
we do again concurrently with the
we do it totally safely so you can make
any changes to those objects and then
when you we give them to you to use
then you can make any changes completely
safely so that's really the big thing
safe and it gives you that uh
elimination of the latency in the way
that you're seeing in that particular
so i would even use another maybe very
but for us when i first heard of azul
drop replacement it was as easy as that
we just we didn't have to think about
this we didn't have to do anything in
our code we didn't have to do anything
and this is cassandra we didn't
obviously touch cassandra we just
and it worked yeah that is a nice point
you can just drop in the jvm you don't
have to change any of your codes there's
no recompilation no recoding to make
take advantage of these features and
even to the point where you don't have
to change any of your startup scripts
you don't have to change any of the
um we make it really really easy from
the the tuning point of view because
if you wanted to go in and actually do
any tuning essentially what you start
with is just changing the size of the
heap and that's that's really all you
command line flags that you would
typically use with other jvms
those are not required for zulu of photo
and so you can you can use the same
startup scripts because any of the ones
support we just ignore so it doesn't
yes and i'll go into the next graph and
is the metrics as seen from a single
it's the same kind of graph that you saw
you saw earlier but this time
if the former graph this is from the
kind of application view and that's why
top is smudged there because it's an
internal application if we go to this
um internal graph you see that this is
and i just smudged the the name of the
data set but you see here that it
so if we had up to 1.5 seconds
when we're talking about milliseconds
the need for milliseconds of operations
on the uh 99 percentile we now see a
total kind of flattening of that line
and this is on the cassandra side as
well so it's the application sees
a healthier status and the cassandra
itself sees a healthier status
and this wonderful graph or multiple
this is from the zing uh proof of
in uh taboola when we started out with
um with with the zulu this is again a
we're going to get newer graphs in a
moment but uh i wanted to bring you
to the kind of beginning of where we
where we started and you see the
the nice trend from left to right you
drop line from left to right to right so
specifically um direct you to graph
on the bottom left side you see on graph
that you have the green line that is
top left to bottom right and that is
timeouts for the application wasn't
within the time budget that my
needed and this again is an application
view this is not a cassandra metric this
and you can see how it totally flattened
on the uh right bottom right side of
and the interesting thing here is why
why do we have this trend over time
because over these few days in december
what we did is we took and
upgraded one two three or five
nodes per day because again this was
knowing what we know today we would just
drop it on all of them at once and
but this was our proof of concept and we
we kind of upgraded our cluster over
so the more nodes received the zing
the better the timeouts were for that
node and then in average over the full
if this is a 60 server cluster and think
60 servers in each cluster times 6 for
all of my data centers globally
then this is a whole lot of servers
upgrade and you can just see that
wonderful trend and then eventually
around the end of december
you see this flat line and think of this
peak time for advertisers for publishers
and with all that going on uh up into
up into christmas the vizing proof of
proving itself to be so amazing you can
on the other graphs the same type of
on the top you can see the cassandra
metrics on graph number two
and on graph number three you can see
go down there on the nine nine nine yes
no no no i was i was just gonna say i
think that's interesting easy because
i you took the approach that i think
everybody would take which is to go okay
let's you know just do it gradually
because we don't want to like put all
our eggs in one basket and suddenly
change everything and find that things
quite the way that we want them to but
i'm assuming that you didn't find any
in terms of functionality changing so
everything just ran exactly the same as
you when you used the old jvm switch to
zing and everything's just doing
exactly what it was doing before it just
now does it faster and doesn't
have the the read timeouts and things
which is exactly what you're looking for
and yeah as i you could clearly see that
the right approach was to do it
gradually but it's nice to see that
that sort of gradual approach and then
suddenly boom you've got everything
it just flat lines at the bottom
and it flatlines in a good way yes
these these are of course much these are
uh we had another another application
internal application that was yet to run
team there was going why like why why
guys you want it you can have it uh test
it see what it does for you
so this is already zing um on the on the
on the on the 20s i mean this year's
and this is as i said two weeks two
and you can see here that this is the
um from a different uh proof of concept
this time and you can see
just the cpu consumption so you're
the same application two identical
servers in terms of hardware and you see
is or the yellow yellowed out line is
is the zing server and the green
area is the nosing server and you see
really big differences in cpu only with
just looking at this before we go into
the rest of the graphs you can see that
on the amounts of servers because my
is and or actually my let's call it it
has a huge impact on my hosting costs
my running costs and my ability to serve
again let's go into those really big big
three billion web pages a day you need a
the less servers you have the better the
yes i was going to say i think this is
also a very interesting graph
from the point of view of showing that
garbage collection concurrently with
some people think that the problem might
be well okay now you're actually placing
a heavy load on the system
and so you're going to degrade the
amount of throughput that you get with
the application because now you're doing
garbage collection work at the same time
and the way we get around that is
because we've actually changed the jit
um so we use a jit compiler called
falcon rather than c2 that you get in
the standard open jdk software and that
um compensate for the fact that we're
doing garbage collection simultaneously
and still get the performance so that as
you're you're delivering lower cpu
with that low latency as well so it's
and the next graph is the timeouts so
there weren't a lot of timeouts you see
but only one server out of the two has
uh you see only one spike very small
spike here and a very small spike here
but only one server has these timeouts
and that's the not zing server so we
kept the same color scheme
so if you look this is this is the same
time frame so if we're looking between
2200 and midnight and again between 2200
the the kind of peaks of timeouts
and we see that the lower the cpu was
we saw on the server but the zinc server
had no timeouts at all looking at the
i'm sorry i just just made one comment
on that which is that where
you did see one one spike there and
sometimes there are things that we can't
account for which is that the underlying
there might be some scheduling things or
you know timeouts that happen
um around the hardware or something like
although i'm not claiming that that's
exactly what it is but often there are
things that we can't compensate for so
we can actually get a flat line
but sometimes we see uh artifacts of the
by eliminating garbage collection pauses
what you're now seeing is
other underlying things so just just
absolutely and if you look at the at the
under the graph the total amount of
one spike here and a tiny spike there
compared to constant timeouts isn't even
so i wouldn't i personally we didn't
and then this um here is a bit of a
different coloring scheme because this
is a view from the load balancer so i'm
sorry we changed the the coloring scheme
harder to follow but you see the purple
is the zing server and you see again the
short kind of spikes of 500s
but if you look at the non-zinc server
you see it's it's showing 500s so if you
again these two yes as you can see when
i saw that graph i didn't even realize
there was a second line on there
yes it's it's almost it's totally flat
these two little spikes here so the
the the little two spikes that that
a.m and between uh 2200 and 2300
and if you're looking at the the red
line again you see that we're constantly
reaching over two percent
of upstream errors which means that two
percent here and then the timeouts
that were in the form of graph all that
shows me that this server which is
identical amount of load is getting me
less results it's manufacturing
less recommendations and that is very
looking at busy threads this is another
and again corresponds same colors as
corresponds to the zing and not zing
and you see the busy threads spike up
on the non non-zing server
a nice graph here i'm sorry do you want
i'm just thinking that that that kind of
ties in with exactly what we've seen
with the other graph so it's a nice
proof of the way that zing works in
the issues that you were seeing
yes and now looking at the at the time
percentile and this is the net time that
taking then you see here that on the
the zinc server is still providing lower
we saw that in the former graphs but
here we also see that it's performing
comparably at the 99 percentile so we're
of answering requests and we see that in
total amount of what we can do with this
server we can actually do
so with this specific application
just by moving and of course because we
familiar and and and confident with our
with our zing capabilities
we just replaced it over a night on all
and just kicked out 90 servers so i now
a spare of you know i have in my pool 90
servers that i can allocate to a
different place and that's 30 percent
on this specific application just by
to zing so from a cost point of view
you figured that was um that was an easy
that was an extremely easy sell because
at an application and looking at the
server and looking at what we can do
our ability to provide services with a
footprint our ability to serve our
faster better with less errors all of
is eventually revenue a on the lack of
errors and better serving that's revenue
on the reduction of server footprint
that is revenue that i'm or capital that
and i'm not spending so yes there is
some level of spend on the
on the zoo licensing but you put all of
it totally fits and there's a proven roi
for this project and that's why we are
zing wherever we just can it actually is
application or i'd call framework within
where it became a brand name with our
they are aware of it it's not me and i t
that i need to push this out and say oh
no no new application make sure that
testing it on uh the correct the correct
jvm make sure that you're
um in production and you're zingified or
however you would like to call it
you can just it's just grass rooted and
right i was gonna say i imagine your cfo
is very happy when he looks at the the
numbers for that kind of thing
yes absolutely good okay well um
what we'll do i'll just just um just to
part of things and just mention a few
things around zing obviously
we've heard the success story here and
and how uh you've had some
really quite impressive results that
immensely in terms of um the data
footprint you've got in the
number of machines and so on and that's
what we're really trying to do with zing
is to produce a low latency high
and again as you said is it's the
simplicity of it being a drop in
you don't need to recode anything you
don't need to recompile you don't even
need to change your startup scripts it's
a very simple migration from using
the old jvm to using zing and
what we're really focusing on especially
with with your type of application loads
by eliminating the latency associated
uh specifically making sure that your
users meet their expectations
and also supporting bigger workloads on
which as we've already explored reduces
you're reducing the provisioning costs
and so on and what i would say is if
anybody's interested in this
and trying it with their own
uh we have a free 30-day trial for zing
zing trial you can download it we have
you in terms of setting things up making
sure that everything's running the way
even though there's a drop in
replacement obviously one of the things
customers with is setting up the way of
measuring the performance
and so we've got some nice tools that
help people to understand exactly what
before using zing and then using zing
with their application we can show the
effects on latency for the jvm because
it's quite important to do that
because although obviously you're
looking at application level
what are your users getting it's also
important to understand how is the
in terms of the application interacting
with the jvm so we've got tools that we
can help with that and produce some nice
again that you can show to your cfo and
say look this is the the results we get
um so that that's pretty much the end of
the slides in the the presentation part
so i guess what we'll do now is we'll
anybody has any questions uh so if we go
hopefully there'll be some questions
i can share how we started out do you
have any questions or do you want me to
oh no if you share how you started out
and then we'll see if anybody has
any questions yes happily so
um this was over three years ago when we
started out we actually have
a few we had back then a few monolithic
and we started with a one terabyte
of heap application that was a single
server and and it had this huge heap
and that was our kind of first test
15 minutes cycle of garbage collection
where the server would just stop
for 15 full minutes and would
would do the garbage collection which is
just the way java works and
we would accept that as long as because
back-end application it was our billing
and once we were able to optimize that
and we suddenly saw that we're getting
this flat work there that was
um one of the first time aha moments
but i do want to say something about
cassandra that i didn't have a chance to
with zing we are able to do something
that is just unattainable unattainable
we reduced our node count and
created extremely dense nodes
over our hardware so we're using the
but we are now able to put multiple
on a single cassandra node which is
usually not recommended actually
if you go to the kind of best practices
up to one terabyte yeah up to one
physical storage per cassandra node but
way beyond that and without zing that
due to latency due to heat issues due to
uh the actual ability of the server to
quickly so there were not only the
ability to answer faster but the
ability to condense our cluster into
more dense nodes and now i see we have
multiple questions i was gonna say so
we've got some questions now so the
is what gc algorithm were you using or
were your applications using before you
do you know i would assume you're
g1 exactly we're using uh g1 and um
and again this was four years ago and uh
i think we played around with a few
others but g1 was the only
the only one that in scale at least uh
was was able to cope with what we were
doing and then we moved on
into zing once we saw the light right
uh second question could you talk about
tuning the nonzing jvm compared to the
since zing uh so that's kind of an
interesting question did you spend a lot
of time trying to performance tune
uh this was this was years ago and um
on the we did on the monolithic uh
less on the front-end servers the
front-end servers we were going like
we don't have a problem it's you know
these small freezes you don't even
really grasp what's happening there
uh until you until you see but yes we
uh we tried different um different
approaches we tried to reduce our heap
size we tried to increase our heat size
we tried to change our cash um
our cash strategies we tried to do
multiple different things
in terms of how we can optimize our
and actually even since moving to zing
we've been working with azul
on our uh on our performance and
continuously beyond the just
default improvements we've been
additional improvements on many fronts
some of them around avx-512 some of them
around um cpu speed some of them around
cpu pinning some of them around numa a
uh ways that we that we've been tweaking
but by far the easiest and fastest and
um the word lucrative isn't isn't
correct here but the most beneficial i
think that's the best word to use here
for us in terms of cost performance was
we moved to zing and we get all these
uh behind it it was so simple and
on uh system optimizations drive
um application optimizations and and the
that that is the thing that made most
and sort of related to that what um the
question is what tools did you use to
okay so uh our monitoring
is based we don't monitor zinc directly
we monitor the whole the whole system
and there's a whole monitoring
framework in there so the observability
the databases prometheus it was a metric
the um log ship the metric shipping
was based uh is based now on on kafka
i'm trying to think what was based back
as the age as the local agent there
there's a whole lot of different there
there was not one tool that we used but
all the graphs you saw here
are our grafana graphs the database now
is as i said is the prometheus
uh database and we're pulling the the
with the with the local agent so that's
um i guess the easiest and shortest
if anyone wants to reach out to linkedin
i can i can provide a much much longer
answer okay another question here is
you spoke about latency flattening have
you ultimately managed to reduce
costs by migrate migrating to zing um i
yes so i i spoke about that uh briefly
i can elaborate uh here a bit more we we
great cost savings um just by
in a few places first our ability to
reduce the server the server count in
we reduce the size of a clusters that's
our our actual front-facing application
earlier four years ago we would be
one percent or two percent of our
timing out or just working slower for
user population now we're answering much
i think it's it's almost now a common
knowledge that the faster you are
online the the more chances you have of
getting that piece of content on his
device and actually having the ability
because if you're too slow the user just
uh so it's it's been improving our
reducing our i.t footprint reducing our
delaying the need for new hardware all
uh brought us to the the relevant cost
yeah so i think there's two things that
isn't that there's obviously the ones
you can easily look at which is
you can measure how many servers you
saved how much uh you spent on the
but then as you said there's the the
cost or the uh the benefit
of being able to serve your customers
quickly and therefore get more
uh business if you like through that um
there's another question here which i
think we answered or you answered at the
beginning which is how large is the
estate and i think you did say how many
in your yes so we have 8500
servers that's 8 500 circle we actually
have a bit more now that's not a
new slide uh so we have a whole lot of
um not all of them are our java or zing
but but we're at thousands of servers
some of these servers are hdfs for
benefit out of zing or we have other
technologies in taboola such as
vertica which again is not a java based
application so wherever we are java by
now i don't think there's a corner in
where we we have a java application and
we don't have uh zing beneath it
powering it up good good to know good to
right um and then i got one more
we tuned jvms based on load how did you
uh batch versus online for zing
uh does that make sense very interesting
we have our front end applications which
all the graphs i showed were for our
front end applications and i only spoke
at the end now about our batch
big monolithic one terabyte application
so on front-end applications it's it's
you see you have 20 servers or 60
in our case hundreds of servers if
you're on aws of course if you're on the
then instances but anyone with whatever
and you can just reduce that footprint
and easy way to look at it because you
more um transactions per server
more transactions per second or you have
less errors whatever the metric or the
relevant metric for you is if it's
sheer volume cpu load whatever
on the back end applications um what
actually was that was the selling point
is this is actually i'm smiling because
now i'm recalling that feeling i'll tell
you about it so we had this back-end
application as i said which was our
and it would crunch all the data coming
in so we have the single server
it's now on spark it's now totally
different with uh with a totally
different technology but back then
it was this big monolithic java
and we have all the billing data coming
in all the lines of logs coming in and
it would pick them up from the drive
and then crunch crunch crunch punch
crunch then and provide a
status of um of where the where where
our system is today in terms of billing
and think that it had to crunch billions
now it would pause on on um
on gc every 15 minutes and if there were
and just think of the boot time of this
if let's say you had a version on it
and so you see the line going down but
then suddenly gc oh okay now you're
waiting it's going up you're waiting
uh it finished the gcf for 15 minutes
and then it's processing again and
you're trying to see is it catching up
and i remember days on days just looking
the the latency of this server how much
is it taking and every time we had this
we had to deploy on it i mean who
remembers deployments anymore
we we're down continuous deployment
we're not there we don't live there
but back then this was a huge thing
and this sun flattening of the line no
hold the world i want to just garbage
and that kind of feeling of
i can just see what the server is doing
no this this really this feeling of
relief so i know that's not a number i'm
sorry it's not a number but it was
a single server so luckily enough
when when we when we did the licensing
is not a big deal in terms of licensing
and just for that feeling
i mean it was just it was it was the um
it was the i.t administrator
appreciation day last friday i mean
feel it for me man feel it for me uh
whoever the question came from
yeah that's excellent um i'm always
happy with happy customers even if it is
it made your life much simpler then
that seems to be all the questions we
have um and so we're at about 40 minutes
so that's that's pretty good
so i think what we'll do is um oh no
we're in the same boat now uh it's
challenging to convince the management
on how much cost savings to bring to
organizations thanks for sharing this
information on working 24 7. um
great so hopefully we've got another
person there that's looking at zing and
we'll try to use the trial and and
hopefully uh get the same
benefits that you've had from that um so
uh just to wrap up i would like to say
um as i said at the beginning we have
recorded this session and we'll be
to the recording so you can share it
with family and friends and we will also
send you a copy the slides
and i would say like to say a very very
big thank you to you ariel for
all of the information you shared with
like i say it's it's a happy customers
azul so with that um thank you
to everybody for attending the webinar