Kiki (K): Hi everyone, welcome to Kartini Teknologi with Kiki and Galuh. Before we start chatting with our guest, we want to congratulate Galuh who has just become a Google Developer Expert in Machine Learning. Congrats!
Galuh (G): Thank you!
K: So cool. Anyway, talking about data, we have Theresia Tanzil with us. She’s a solution architect at ScrapingHub. Do you want to introduce yourself a little bit?
Tere (T): Sure. Hi everyone, hi Kiki, hi Galuh. I’m Theresia Tanzil but you can call me Tere. Currently I work at ScrapingHub, a web scraping company, so our products consist of tools to custom development service for web scraping. I’m in my 6th year at ScrapingHub. And a solution architect is more of a pre-sales role, so technical sales where we give… the point is, people can request to ScrapingHub what kind of data they want to collect, and then we assess whether the scope is feasible or not, the estimation, the scope, and we would give them the estimation, quotation, stuff like that. My background is… I took IT in Binus and graduated in 2007.
K: That’s a while ago. Can you tell us how you got interested with technology? Have you always had this “I want to be a programmer no matter what” feeling or how?
T: It’s kind of a funny story now that I’m in the tech field. When I was in junior high, I hated the ICT subject because the lab was so far, and it was so boring. That was in 97, so you can tell how old I am. The lessons were boring, and I didn’t understand what word processors were for, WordStar, Lotus… boring. In the third grade of junior high I learned PowerPoint, Word, and so on. But since I didn’t have a computer at home, I couldn’t practice it, so… the lessons were not fun. By the end of 2000 I got a PC. I could lobby my parents to buy me a PC. I started tinkering and got introduced to the Internet. That’s when I saw that wow… there was so much information out there, so it made it more fun, and I started learning how to build websites with HTML. We didn’t have CSS back then. There were marquees, FrontPage, What You See is What You Get editor. During high school I started tinkering, I began to like computers, bought CDs in Mangga Dua containing underground zines on cracking, hacking, carding and stuff. When I finished high school, I thought, what major should I take? My main interest was psychology and language. But I thought I could learn those by myself. I could learn psychology for fun, and the most practical option seemed to be IT. So I majored in IT at Binus. There were options like computer system, IT, information system, informatics management, computer accountancy. I picked IT because in computer system I had to learn physics and I sucked at it. Information system seems to require lots of memorization and it seemed boring. Accountancy was like… I wasn’t interested, so I took IT. So that’s how. I was already interested since high school, maybe it’s because I love learning, so when I met the Internet I was like… wow, the world is… there is so much information out there. When I graduated, I started seeking for IT jobs, but what’s funny is my first job wasn’t in IT. My first job is as a news writer. So my first company actually opened two job vacancies, first one was news writer, and second one was IT staff. I applied for both, and they called me. First they called me for the news writer role… err, no, the IT one first. When I was interviewed, they asked me, did you apply for a news writer role as well? Maybe they managed the HRD’s email accounts so they knew the applications that came in. So they asked, did you apply for a news writer role? If you can choose, which one would you choose? So they asked me to pick one. I thought, okay, I already have a diploma in IT. If I want to get back to the field I can do it later. I said, I want to try the news writer role. I passed the tests and I learned news writing. But at the end I went back to IT because I think news writing can be a hobby. So I continued until now.
G: I’m curious, I’m interested when you mentioned during high school you bought CDs from Mangga Dua and learned hacking from there. Did you tinker by yourself or were there communities on those topics that you joined so you could learn together as well?
T: I learned a lot by reading, but yes we had Yahoo! Groups back then. And in Yahoo! Groups there were many communities but they were like mailing lists, right… and my English was still terrible. So I mostly just read, I didn’t actively participate, didn’t ask questions. This was before StackOverflow and stuff. Obviously I’m standing on the shoulders of giants but I’m not like… you know, active in the discussion and stuff, more like practicing on my own, trial-and-error on my own.
G: I see, so before StackOverflow we had Yahoo! Groups…
T: Even before Google apparently. If I think about it I did search for things in Yahoo! Search. I feel so old now.
G: But it’s true, the first time I used the Internet the first thing I opened was Yahoo!. Google… wasn’t as known as Yahoo!. I also played games in Yahoo! Games. And Yahoo! Mail for email.
T: ICQ, oh my God, mIRC.
K: We also went through those, don’t worry.
T: Oh really, well, thank God.
T: Mostly Python, and well not just mostly, more like exclusively Python because in ScrapingHub we… if you know Scrapy for web crawler, ScrapingHub is the founder. So it started as an open-source project, and then they made a company out of it, and it’s mainly Scrapy, so our stack is mostly Python.
K: So in the company the focus in Python. Cool, okay. You already have a lot of experiences. Is there any project that is the most memorable one for you, or one that makes you feel like, “I’m proud I have worked on this”?
T: I am most proud of… I do have one that is the most memorable. I’m proud of it because it’s so memorable. So the story is also quite funny. So now I’m more involved in data engineering at the current company, but I started by trying everything, from desktop, web, then I finally just chose the web, I focused on the web. Then from the web, I tried front-end, backend, and it turns out I prefer backend, then data engineering. Now, during the transition from backend to data engineering, I worked for a media monitoring company. So we monitored all media, from print, print media, newspapers, magazines, online media, radio, TV at that time. Well, I built a tool… you could say at that time I handled it from operational, made the product, made the CMS (Content Management System), until troubleshooting, like for example if the Internet died and so on. And I worked remotely too at that time. I went to the office once a month, if necessary. Well, the one that is the most memorable was the CMS and the web client. At that time we wanted to migrate from old CMS to new CMS. The feature is going to be upgraded, there were a lot of technical debt and stuff. Then we … okay, I said, we will … okay, it seems like most of the features are already implemented in this new CMS. We will roll-out it out, I think we will do an alpha release first for testing. We sent an email a few days before, we will start testing the CMS, and so on. Well, unfortunately, suppose that it was the 28th February when I sent the e-mail, then on the 1st somehow the old CMS was bugging. So when the operational team logged in on the 1st, they saw that the CMS is empty. Media monitoring has targets that have to be reached, so we had to track right from the paper was published, by dawn someone has to be upload it. So at 5 in the morning the people who were in the operational team logged in. They saw that the old CMS was empty, so they thought they had to use the new CMS. They uploaded it directly to the new CMS, even though the new CMS is not ready. Then I woke up, at 8 in the morning there were a lot of messages in YM—yes we were still using YM at that time—there were no Slack and others. Then I was like, what’s going on?There were phone calls, missed calls and others. Why is the content missing from the old CMS, and why does the new CMS have content? Even though the features aren’t all finished yet, inevitably I finally implemented them within a week or two. These were the craziest weeks of my career where I had to implement all the features that were still missing in the new CMS while making sure that all push mails, all operations could still work and there was … what yes, there were gaps as well. It made me crazy. I thought I went gray in a week, but it was really memorable because I felt that I had overcome that, so whatever happens in the future, it seems like it’s nothing compared to what I went through. So you could say it’s one of the most memorable projects.
K: Because it’s full of challenges, right, so you surely learned a lot within that week didn’t you?
T: Learning a lot in terms of … actually technically I didn’t learn new things, I didn’t learn a new stack. But it turned out that my capability yesterday was only 40% and it could have been much more than that. It was like a superhuman. Maybe I can share one more silly thing that’s memorable though. I dropped a production database. And it was in the same company. And what’s ridiculous was there was no backup. There were backups, so there happened to be a server migration plan, and coincidentally the web host already made a snapshot just before we migrated. They made a snapshot but it was already two months before. So, you can imagine right. I learned a lot about the non-technical things, how we can recover from that kind of mess.
K: But … that means, do you believe that we can learn more from failures like that?
T: Yes, a lot.
K: We often learn from mistakes like that.
G: There’s no way to forget them once they happened.
K: That’s why it’s memorable this time.
K: It’s memorable, and it makes for a good story. Because when I read again I was like oh it turns out there are also many other people dropping their production databases, I learned things like oh it turns out that backup is important, oh it turns out that my manager is very kind…
M: About production database, maybe the solution is you need to have a backup. For the first case, let’s say that it happens now with the existing technology that we have. What will you do? Will you do the same thing or… maybe with the existing technologies maybe you can do things differently, or what?
T: Hmm. Looks like I can’t change anything for the first case because it’s a purely human error. So people made the wrong assumptions and that was it. Actually, I could if I wanted it, I could make them go back to the old one. But I decided that I couldn’t, the show must go on, anyway, I’m sure 80% of this new CMS were already done, so I thought rather than going back and forth, let’s bring it on. I think new technology won’t solve it. But it’s an interesting question, there are things that we couldn’t do then but we can now. If I think of anything, I’ll share.
M: OK, OK. You worked on a project for the UN. And you combined lots of different data from different sources. So, what’s the story?
T: Maybe I’ll start with the background, why I could join the project too. So it was the UN and UNDP… Pulse Lab actually, Pulse Lab Jakarta. So it’s a research lab between UNDP and Bappenas. So the main objective of the lab project is to help Bappenas look at alternative data sources. So the conventional data in Bappenas is quite progressive, so apparently Bappenas thought oh we have to leverage big data, right? Big data is .. in a sense, an alternative. It does feel like the term big data is a bit overhyped, like what’s big data anyway, but I might just use alternative data sources here, so tweets data, forum data like conversations on the Internet are data that have never been previously considered to be overlaid. So what we did at that time was that we actually had a subscription to the API from Twitter and others. At that time there was still ada datasyncinc [dot]com I’m not too sure if it still exists, was it bought by Twitter and is it now part of the enterprise API? Anyway at that time we used it, so the data from Twitter was then … so it was mission by mission right, one of the missions—not the mission—the research conducted at that time was overlaying basic food prices, like there was a time series of data so Bappenas had data on basic food price movements, well, we overlaid it with Twitter data. So people tweet about the price of chillies, the price of rice and others are made from time and from the location we can also map them. Actually at that time the combination was just pretty simple Twitter, the two data. And for the library like it’s standard, d3 is for visualization, like data wrangling, cleansing and other Python. So technically it’s not too sexy, meaning that the stack isn’t actually complicated, it’s just that concept. In that project, I learned more about communication and big picture between agencies and industries. Not really learning technical stuff actually at that time.
K: How long did it take you to work on it?
Q: The output was actually… we wanted to make a dashboard where we visualize the trends and so on, correlation and everything else. So the deliverables were very specific. So at that time, if I am not mistaken the process consists of gathering the data and then meetings, there were really many meetings at that time because you know UNDP and Bappenas, you can imaginewhat bureaucracy was like, liaising and other things until it finally became a microsite. I think it took me like a month or less. Because the data was actually not dynamic, it was just static data, we overlaid our static data and made the dashboard, so we weren’t reading stream data.
G: You have worked with data for a long time. Do you think there is a difference in the data that you processed let’s say a few years ago, compared to the data you’re working with now? Maybe today’s data is more complex or more unstructured or the source is more diverse?
T: Hmm … seems like the character of the data doesn’t change too much, the volume also doesn’t change much… I don’t really notice it… well I guess it really depends on the context. But I think what has changed most is really the technology to process the data itself. So in the beginning there was no MongoDB and others tools were more difficult right. Like when you were processing unstructured data for example. The ones that we got from the media monitoring project were stored in MySQL. Imagine text processing (with MySQL). Now there is ElasticSearch, querying is easier. We also have Solr. Well we had Solr back then but. Anyway. So it’s more like the tools, maybe if we wanted to process data, batch data, we only had Hadoop, there were no other options. But once there is an Apache Storm, it turns out that oh it’s easy, the data can be streamed, and you can make calculations almost real-time, then there’s lots more. So I see more changes there. Now besides that, what I’m personally more aware about is, when I worked on the media monitoring project I played with media data, which is articles, news, audio, video, but when I was on ScrapingHub I was seeing more use cases, like all the public data out there… when we are browsing public websites, we’re just like yeah, I was just as a user at that time and I was like okay, all right, but I did not see that there are values that can be taken from there. But once I was in ScrapingHub I saw that oh people would pay to collect these data, they can use this data again whether it’s used for analysis or market research or they can use this to feed for their training model, make business on top of this public data. So I learned that, oh it turns out that data is very interesting, when I was in media monitoring I already felt that data is interesting, but from the business perspective, it’s really unlimited. So maybe those are the changes.
G: I see, so to summarize, first the tools, then the second might be user awareness, clients are also becoming more aware that oh apparently I can get value from this kind of data and that’s why they request ScrapingHub to get the data.
K: Let’s shift to Tere’s current career in ScrapingHub because we already touched on that a little bit. Can you tell us more in detail about ScrapingHub, what does it do and what’s the business model is like if you can share about that?
T: Maybe we can start with the product first. So what is web scraping actually? Maybe it’s a bit … if you’re not in the data field, maybe you’re like what’s web scraping, what is getting scrapped exactly? So actually it’s a subset of data. Think of the data universe. Data engineering is data engineering, processing, collection. Web scraping is part of the data collection. Data collection means that we provide ecosystems for collecting data. It could be the services of hosting the spiders… we call the scrapers spiders, basically it’s like Google spiders crawling the web and everything, so from making spider services, then platforms to host spiders, proxy managers and others. So those are the products. When I entered ScrapingHub I was a crawl engineer. Actually, I was a Python developer at that time. But the Python developer was very specific for making spiders. So now we use the term crawl engineer, not crawling as in crawling but… you know, crawling the web. I joined them in 2015, continued to make spiders blah blah blah, then at that time there was no solution architect team. This solution architect team is actually to help the sales sales team. Because in the past, if sales were making estimates, you can imagine, right? The project might actually be complicated but they would say it can be finished in two days, then at that time the company might need this role to take a look at the requirements and so they could actually give a reasonable quotation for a project like that. Well, so I did it, continued to do solution architecting, and from 2018 up until now we have scale up the team. Now there are seven people in the team. My career goes from implementation—actually writing the spiders—to nontechnical because when making estimates we don’t need to implement the spider right, we just need to learn how this website works, for example they need this data or how to do the crawling, how does the discovery of crawling logic look like and what parts must be extracted, whether the data is unstructured data or not. If the data is unstructured it becomes more complicated, it might take us two days or three days and stuff. Since the team has scaled up, I’m increasingly switching to managing the team. So that’s the high-level explanation.
K: It’s more like a bridge between the sales team and technical insights, right?
T: It’s like helping… and the role is very customer-facing, so it’s like gathering requirements to customers, pre-sales.
K: What do you usually need to consider before finally sending the quotation?
T: Hmm… because our quotation is based on complexity, so we really look at it, we have to really understand what the data is… first, actually, the business goal. What do they need this data for? This client needs this data for what purpose, because sometimes we can suggest new data sources or other data sources that are more suitable for answering that business goal. For example if they come and say “I need a database of all dentists in America”, for example. Then their request is “please crawl Yellow Pages”. Well, we can see that oh actually besides the Yellow Pages there is something more suitable. We must understand the business goals, so we aren’t only given the instructions. We are consulting them for their best business needs. What things are being assessed? First, are sure you need the whole data for the whole United States? Or do you just need the data for California? If they really need the United States, that way of crawling the web is different from if they only need California. Does it have to input zipcode on its website and others. First is the crawling strategy, the second is the extraction strategy. Once we arrive at the list of dentists, how to get all the dentists? Do we have to follow the pagination… maybe this is too much into the logic of web scraping, but the point is that complexity must be considered to get the data itself and how the collected data can be consumed. The data that we collect can be delivered as csv or as json or like the special format that the client requested. But the client can also ask “please insert it into my database”, it’s all part of the complexity of the project. Oh, sometimes crawling is identical to bans, right? It means that if the website is too aggressive, it can ban the bot and stuff. So we also have to consider speed, you need this database within how many days? If for example they say, “I need it fast, in a week” it means we have to know whether this is too aggressive or not. And what kind proxy rotator do we have to apply. Maybe this is too detailed, but it’s more or less like that.
K: The thing is clients sometimes doesn’t really understand what they want, so the solution architect’s job is to translate their goals to the technical requirements.
T: Yes, that’s true. Sometimes I get extraordinary requests, for example “I want to subscribe, crawl all Amazon data, all departments, I want to refresh the data every day”. Then we are like, okay… do you know how big… what are you trying to do, actually? We can do it, technically possible, but it’s going to be very expensive. You’re right, we have to think from the customer’s perspective, from the developer’s perspective, how are they going to work on this? From the project management side, does the estimation make sense or not. But if the estimate is too high, sales can’t sell this either. So you have to keep everybody happy. But yeah, usually customer doesn’t know hat they want.
K: So what are the challenges that you have faced so far, being a solution architect?
T: Well, the biggest challenge is that. It’s more like balancing, because it’s really like sitting in the middle of many parties who will do it. Between having to help sales sales, having to really help the client reach their business goals, have to pour it into a clear contract, this is really the scope, you know, this is the deliverable, we have agreed, this is a sample scheme you know and others. Then when the developer is working, there may be things that we cannot cover when we assess it, because we don’t implement anything right. We are only testing, maybe we are coding a little, proof of concept. But there is also a handover process so that the developer also knows first the scope, then the second well how to implement it as quickly as possible, as effectively as possible, without the need to relearn right, we’ve assessed this, I’ve seen this, actually the best method for this approach, but sometimes the developers are also yes, they have many paths to Rome, so maybe the way we find and we assume the pre-sales time, make an estimation maybe the end will not be implemented like that, it could be more fast, it can be slower. Yes the challenge is keep everybody happy.
K: Right it is difficult if it relates to the interests of many people. But do you enjoy your job right now? Since things are different from when you were a programmer who only worked on the technical stuff. Now maybe you have to pay attention to the needs of many people too, right.
T: Enjoy? I really enjoy it. Because apparently I might like coding, but it’s more like… there is a common thread why I enjoy working now too. That’s because I can reverse engineer things and figure things out, figure out how things work. I used to figure out how code works, but now I figure how humans work. So it’s like .. the challenges are different, there are still technicalities. I’m not even bored, every day I have to look at different websites and usecases, even the salesperson is different. So I never get bored, so I really enjoy it, especially remote working and so on, right, it’s so fun.
K: Now this is interesting, we already touch on remote working for a bit. But before going there, earlier you said that you also manage the team. How do you describe your management style? I mean, as a leader, what are you like? Can you tell us?
T: You probably should ask my team. But yeah… the biggest lesson that I learned since I started managing people is that the key is only in understanding incentives. So whatever system or process that is implemented when managing, we must look at… that means any policy, whatever metrics set for the team, do they make sense or not? For example, just a concrete example, the team, for example, we say, OK, the Solutions Architect team is targeting to close as much as possible… closing as many deals as possible, to help us sell as many projects as possible. Now that’s the wrong incentive, because it could be that we just write low quotations, we’re not responsible for the success of the projects and stuff. That’s one of the most, well not memorable, it’s just like… it turns out that if I can understand the incentives that actually are the most appropriate, I can motivate… first, set context for the team, because it doesn’t matter as long as we have a very good and balanced metric, sane metrics—it’s not like selling as many as you can—we don’t need to tell them that “this is the way” step by step. Each person can think about how they can contribute to that number. So the management style focuses on intangible things, I don’t really tell them the step-by-step, but I think about where we should go, where we are, and what is blocking us to get there… and unblocking stuff. When it comes to style… I’m more comfortable on one-on-one though. Because I think every team member has… they are where they are, and nobody is at the same place. So how can I understand each of them, what their situation is like, what are their main motivators, like one of the people in my team is very motivated, he really enjoys tinkering. There is also one who is really motivated by salary for example. There are also those who are motivated by… different things, some have background like this, like that. How do you make it all balanced so that as a team we can move together? And I feel that advice can’t be too general, so I have to look at each person, “oh this is the approach for this person”, so my style is more personalized.
K: So it is more like communicating the vision and more about personalized approach, right? Okay. Now let’s talk about remote working. I actually have read your article about What You Need to Know Before Working Remotely. Can you summarize that article a little bit?
Q: I forgot what I wrote there haha.
K: I think remote working is really hyped here, people are like “I want to have a remote job”. According to you, what are the things needed for people to work remotely?
T: Hmm. Yes, remote work is really hot right now, but from what I see, not everyone is suitable for remote work. So maybe the first thing is, if for example you want to work remotely, you are disciplined. Discipline in the sense that you must know when to work, when to stop working. Because sometimes the drawing boundaries is difficult. Maybe we all work remote here so we know the myth that people who work remote will be lazy or the output would not be as good as onsite, we know it’s actually wrong because we become overcompensating. We feel like if we don’t produce anything in one day, we feel like why it feels like I’m not doing anything? More about self-discipline and knowing whether you have a tendency to overwork yourself. So those two things.
K: I see. That’s from our own side. What do you think from the company side? Because actually it seems there are no local companies that allowed their employees to full remote yet. It seems most companies are still implementing remote work as a half measure.
T: Hmm, yes. Actually… so before I wrote What You Need to Know Before Working Remotely, I had written about working remotely (Bekerja Secara Remote) and it was full in Indonesian and I wrote … because of the article, many asked, my company really wants to try remote work, but how can we do it? First, what tools should I consider, and secondly how to look for people who can work remotely? Honestly I think Indonesian people are not used to it. First, they are not accustomed to working without supervision. Maybe there are some… like it’s small population. So how can companies transition to remote work? In my opinion, we have to try. Different companies must be different… first I will ask anyway, why do you want the remote? And second, what division do you want to enable remote work for? From there I’m more I can see… does the purpose or goal make sense or not? Is it just oh, I want to save costs for building rent, for example. Actually it’s one of the most valid reasons. But there are also reasons that I think don’t make much sense, and with functions like HRD, finance, it’s very tricky. You can’t decide to enable remote work for engineering divisions without allowing other divisions as well. So it’s really tricky now, I don’t know.
K: Maybe because there are still many people who are still trying to figure out, how does this thing actually work?
T: Yes, I don’t think there’s a best practice, in my opinion, like “hey you just need to implement this and that”.
K: It’s has to be a trial-and-error process, right.
T: Yes, yes. What I often hear is communication, right? Communication an … especially if for example there are some engineering teams that can be remote, there are others that are not remote. It will be tricky for the ones who are remote because there is a decision-making process in which they are ultimately not involved in. So if you want remote work, you have to do it all if you want to be fair. Or…
K: Or remote-first, that is. So we optimize for remote even though there are people who work onsite too. If you want to create a remote team, you can’t do that, for example if you have a meeting, you have to think about how do we have to include people who aren’t in the office? I think that the half-baked remote work makes people skeptical about remote work. This is because they are trying out something that is actually not optimized for it, I think. Do you think the term remote working is actually overrated? Because I think it is overglorified too much lately.
T: It seems like it… maybe… I’ve just discussed this with my friends on Twitter. On the why… I think people are just curious about this. The neighboring grass always looks greener. People who have never worked remotely would be like “oh remote work sounds nice”, but they don’t know what goes on behind the scenes. But is it overglorified? Hmm, I don’t know either, because with the coronavirus cases and flooding and so on, this is actually a good opportunity to promote remote work. It’s actually really cool, right?
G: But yeah, because of the coronavirus cases and flooding… having the ability to work from home or to work remotely is something that is really valuable. Especially with tools like Zoom, Slack that we can use to conduct work without having to… you know, the tools that already exist make it easier if we want to switch directly to remote work so it’s very helpful.
K: Do you believe that in the future maybe more companies will allow employees to work remotely?
T: We can’t seem to avoid it. If it’s not like that it feels like we are moving backwards. Because actually most of our work as knowledge workers and maybe in the field of technology… although we go to the office but actually we only spend a day in front of the computer, we talk with friends about work and actually can collaborate with digital tools anyway. So I don’t think there is… no reason it won’t happen.
G: Besides flooding and corona it’s very helpful let’s say for mothers so that they can take care of their children, they don’t want waste too much time on the roads. Not only mothers actually, parents, fathers as well. That’s a valuable thing.
K: For me, it’s actually helpful too. Because last year, when I stayed at home, my sister already has a child and she’s a teacher, so she teaches in the morning. So in the morning I can help take care of my nephew. So that’s something that is possible because I can work remotely. If it’s not the case I might not be able to help take care of my nephew.
T: For the company it can also be an advantage for hiring. Assuming you can offer a lower salary, well it’s not advisable, but it can be, you can say that one of the benefits is you tcan work remotely.
K: Okay. I think that’s all for our conversations with Tere. Once again, thank you very much for chatting with us! It’s so exciting to chat with you.
T: Yes, you’re welcome.