"Is our app sticky? The key feature of the FinTech app where you work is the ability for users to make loans to friends – it's also how you make money. If they stop doing it over time that's a problem!
Bank of Yu Offices – Late at night
Thanks again for staying late\nI have a sensitive project we need to work on\nDespite all the good work we've done to bring in new users, I'm worried they're not sticking around\nAfter they make their first loans, are they making more?\nThat's the important thing for us to crack\nBut of course I don't want to alarm anyone that we're looking into it\nI'd appreciate if you could run a cohort analysis for me\nShare the results directly with me, no one else
One of our firm's favorite measurements is the cohort analysis
Cohort Analysis A cohort analysis is a form of longitudinal study used in medicine, social science and marketing. A cohort is a group of users who share something in common. You could group users by their sign-up date, first purchase month, or acquisition channel. By tracking these groups over time, you can spot trends, and understand repeat customer behavior. \nThis type of analysis is essential for discovering retention trends because looking at aggregate metrics you can't tell if users are coming back. For example you could have a growing total user count, but that might be bolstered by the number of new users joining. When you group users into cohorts based on the month they signed up, you can more easily notice that a large percentage of them are dropping off.\nCohort analysis requires a cohort period, be it day, week or month, which you use as the X axis when visualizing it. If you're using monthly cohorts, a cohort period of 2 would refer to the second month they were active, 3 would be the 3rd month and so on. This data can either be visualized as a line chart, with one line per cohort and the cohort period on the x axis, or as a heatmap. Heatmaps shade each box representing a cohort and cohort period by the percentage of users who did the relevant action and were retained. Dark or light spots can indicate peaks or dips in retention activity, which makes it easy to scan and highlight anomalies.
When conducting a cohort analysis it's important to choose a behavior to analyze that is supposed to be repeated regularly. For example if you're looking at daily cohorts, is the action something you expect users to do daily?
The following code is a Google Colab notebook which contains the script necessary to run a Cohort Analysis. Make a copy then review the code and run each cell to see what it does. There is an accompanying video underneath the link to the notebook that walks you through it. You will be expected to answer questions about the results of the analysis afterwards.
Cohort Analysis I ain't going to walk you through how this whole works. So the first section here is just importing all the libraries and you just want to hit shift, enter where you can hit the play button hit, and it's going to connect you to Google CoLab to the backend. And then I should say connected and then finish running and you shouldn't have any issues there.\nThen you want to authenticate. So. Let's see if it asks me. Yeah. Okay. So you're going to click through and then choose one of your emails and then just ask you to click allow, and then. Some token to paste in, and then once you enter that, then you'll be authenticated and that's going to let you pull data directly from Google sheets.\nSo we run this next cell. I mean, you have, it's your shift and to then that's going to pull in the data from this spreadsheet here which is all of the loan data for you know, for this. When, when the loan was originated then what the ID was the user add that loan theme out. And then the amount they learned as well as the settlement.\nWhich we're not gonna use. And in order to create a cohort analysis, we need to do a few things. First thing we need to do is create a loan period because we're going to look at the month and year that they learned out the money assets, 2021 here, as it's basically coming off this 11. And that that's just using strip time, which is built into pandas.\nThen we're going to get the user ID and kind of make it a use that to kind of find the earliest loan period. So that's what this group buys is doing. Our first reset, the user ID is to index and then we grouped by user ID, which is level zero. And once you set it as the index, and then we find the minimum loan period for that user and then we can just you know, reset the index and, and then we have So that gives us the cohort group.\nI said, you can see here this user it's all the same user 47. They have a learn period, 20, 21 and a cohort group, 20 20 0 1. And then actually, if we just kind of run this, oh, actually we would just pull up the data. If you've got to run these in term then you'll end up getting errors. I like, I just got.\nCool. So let's follow that. She's a down, so 47. Okay. In the second month then you know, they still have the cohort group of 20 21. So that's doing exactly what we intended. Okay. The other thing we need to do, this is a common problem. We just need to replace the dollar sign in the loan amount because otherwise it's not going to aggregate it.\nCause it's, I thinks it's a string should be a number. And we're going to make that a flute that you can see here that now the D type is float. So you just do that. And then this part is where it gets a bit more interesting. So this is again using a group by, but this time we're grouping by cohort group and London.\nI said that's this here. And then and then we're aggregating the data as we're aggregating user ID, loan ID and loan amount. And then we're just renaming them here to total users, total ones. So what this gives you is, you know, for this cohort what, you know, dates they were present in, and then how many total users they have of those dates, as well as how many loans those users made in those dates and the amount of those.\nThen this function here basically just takes the loan cohort period. And it gives you a number. So this is, you know, cohort period 1, 2, 3, 4, 5, as this is. So we don't have to worry about the dates so we can just show you know all of the users you know, that their first cohort, the second cohort, the third cohort cohort.\nSo that. Connect you know, different cohorts together at to see how they trend over time. And you'll, that will be clear in the second. Then there's just a few more things to do before we can see the final the final result. We just need to reset the index and then set it as a, the cohort group, a group of COVID.\nAnd then we create this group hood cohort group size. So that's a group by these two things. And what that does is kind of give you this, which you use down here to divide in order to get percentages. So don't worry too much if you don't understand that. But basically what it's doing is giving the total number of users for each cohort group.\nAnd then you can basically divide. That percentage in order to get this. Okay. So now this is unstacking. So this is kind of like the opposite of pivoting and basically then dividing in order to get this here. So let me just run these parts that we haven't run yet. So we did this.\nNope. Okay. Let's go back here. Yeah. And then. Oh, we've already done this as middle of the science in there anymore. So don't worry about that\nor that amount. And then this is the cohort period. So you can see here 1, 2, 3, 4, 5 and then this is creating that group and now this should run. Okay, cool. So what this is doing is now giving us a percentages of. This whole period. So you can see coho period one you know, what was the cohort group?\nSo you can basically see you know, this type of visualization. Which is you know, by cohort group what percentage of users were still around after like one month, two months, three months, et cetera. So in the first month it's always a hundred percent and then you can see it. You know, the cohort that started in February you know, less of them were around than the cohort that started in January.\nSo that's indication that the, you know, whatever we did in February or there's uses required in February one, as good as. You know, the ones from January and to see all of this together you can do a heat map and the heat map basically shows all of this stuff. But in percentages you know, in a, kind of an easy to read format so we can see it's always a hundred percent in the first cohort period.\nBut then everyone in January we have 36% of people who, who made another loan in February 45% in March. So it can actually go up as well as down, you can see here, this kind of a normally where you know, people are starting in January after 13 months, suddenly 50% came back and originated a loan.\nSo that's interesting to look at and broadly speaking, you can kind of use the heat. These colors in order to see what's important. So you can see that in every, you know, when something is dark, that's bad you know, we can see a lot of the cohorts you know down to like five, 10, 8% you know, after just a few months so that's kind of troubling.\nBut then, you know, some cohorts we've got pretty strong retention here and the reason it goes diagonal is because. You know, for people who joined in February 20, 21, we only have two months worth of data. So so that's that's why you look at a cohort analysis in this way, and that's why it's always kind of a diagonal.
What makes a heatmap ideal for visualizing cohort analysis?
Is retention improving or getting worse?
What would be a good metric to use in a weekly cohort analysis?
Google how to average the columns of a pandas dataframe. How would you do that to get the mean retention per cohort period?
What was the average retention of users in the 6th cohort period?
Yikes, ok that's not as promising as I hoped\nLooks like we need to see what we did right in January and try and replicate it\nThanks for doing this, send me the files and I'll present to the CEO\nI'll be sure to give you all the credit!
"