완전 쉽게 파이썬으로 텍스트 및 이미지 크롤링하기 | 완성형 서비스 만들기 1강
You can see the files are well stored from 1st to 40th. Save and you can see the photo data downloaded like this. Hello. This is JoCoding, an easy coding channel that anyone can learn from. As I announced on the community, starting with this video, we’ll work on a project that creates a complete web and app service, launch, market, and monetize it, The service we’re going to make is an web and app service called, AI’s My Animal Look-alike Test. Using the animal look-alike celebrities for machine learning, this service will be able to analyze your uploaded photo and determine what is your animal look-alike, which you can share on your SNS. For the first step, we’ll gather photo data of celebrities with animal look-alikes. But it would take a long time to search and download individual photos. So we will use a technique called crawling. Crawling is a technique that automatically gathers only the information we want from the Internet. Even with basic crawling knowledge, you can make many practical and fun projects. For example, if you crawl a school or an office’s meal plan, you can create a service that provides today’s menu. You can also make a service that allows you to pick and choose news by gathering news on specific topic from various news websites It can be used to make so many different services. The principle behind crawling is very simple. All we have to do is load the site we want to get information from, find the information we want there, and write the code to get that information. In crawling libraries a you can use any language, but let’s practice using Python, which is popular these days. I’ll use an easy explanation so even if you don’t know Python, please keep watching. Let’s start with the most basic part of crawling. Let’s look at the screen. On today’s lesson, we will learn about crawling text and images. First of all, we’ll learn the principles by simply setting up a development environment and starting with a simple example. Then we’ will apply it and crawl Naver real-time search ranking, and save it as a text file. The last highlight of the video is… We will crawl hundreds of animal look-alike celebrities photos by using an easy-to-use library, with the least amount of codes. We’ll learning useful functions so keep watching. Today we are going to crawl using Python today.. In order to do so, we need to install Python and Python related libraries on my computer. But it can cause version collision when working on other projects if you just install it on the computer. So normally we go through a complex process of forming a virtual environment and installing there. That would be a hassle. So we will use Cloud IDE that we used on the Ruby on Rils lesson last time. Cloud IDE provides a virtual machine with hidden development environment. No additional language installation process is required, and when you do another task, you can create a new virtual machine and work there. so there is no risk of version collision. Recently I am working as an evangelist with support from Cloud IDE. If you try it yourself, you will know that it takes quite a while to configure a local development environment. Using this cloud-based IDE service can save you a lot of time and make it easier to focus on development. Also, Cloud IDE is a decent cloud IDE service that can be used for free.
If you are uncomfortable, you can use other IDE or local environment. This is the official homepage of the Cloud IDE. The address is ide.goorm, io and I will leave it in the video. After registering, press the Dashboard button here. Click the Create New Container button at the bottom. This is where we set up our environment for containers,
the virtual machines we will use. We will be provided a computer that we can use according to
what we write in the form. So I’ll fill out the form. We are going to crawl so name will be crawling, and the area is Seoul.
If the availability is Public, others can come and see.
Private is only for me. We’ll open it as Public. You can just leave the template as is, publish to Not used, and
pre-install the software stack, which you will install in the virtual environment. Because we will use Python, we can check Python and press the Create button without changing anything else. Once the container is created, click the Run Container button. Now we are in a virtual machine environment with Python installed. To briefly explain, the screen contains a built-in browser such as Windows Explorer and a command input window at the bottom. If you open the file, you will see a screen to edit the code. Where the screens needed for coding are in one place is called an integrated development environment, IDE. And here, index.py is a basic file created with Python code. Let’s check if the Python code runs well here. To do this, enter the python index.py file name in this command prompt and you will see that ‘hello python’ is executed. Then we can write the crawl code here. We will use a library called Beautiful Soup for this crawl. Beautiful Soup is a Python package, or Python library, for parsing HTML and XML. It’s a very popular library, so there are lots of documents to refer to. We are going to enter Wikipedia here. Here is an example code with explanation. Let’s use this as it is. Copy the whole thing and paste it into index.py. Save it and let’s try running it. To execute, press the upper button on the keyboard to display the previous command and press enter to execute. But here is the Module Not Found error. This is because, it couldn’t load this module called bs4, right here, which is caused by not having the Beautiful Soup 4 version installed. The installation is very simple. Type ‘pip install bs4’ into the command prompt and it will automatically download and go through the installation completely. For those of you who are unfamiliar with the concept of installing packages, if you need to produce a presentation when using a Windows PC, you download and install the PPT. And when we have to write a document, you download and install Word? Like so, on Windows, you can install programs through these installation files and use the functions you want. Similarly, if you think of Python as a computer, there are libraries available for Python. When installing and writing a library to be used for crawling and analyzing data, install and use Pandas. And when you install and use these pre-made functions and programs like this, in the case of Python, all you have to do is download and install the program automatically using the Python package manager, pip, by typing ‘pip install package name’ into the command. This concept is not unique to Python, but is present in all languages. The Ruby Gem that was covered in the last Ruby on Rails video has the same concept, and so is npm in Node.js. When we use a Windows computer, we don’t just use the basic functions of Windows, but also we can implement the same function when writing a program in a language, just like installing an external program. If you have a library, it’s much quicker and more efficient to get it. Now that the installation is complete, let’s run it again. Take the previous command and press enter. Now you can see the result is good. Let’s look at the code line by line how this result is obtained. The second line above it is not important code and the code still works even if you delete it. So I’ll delete it. The next two lines of code load the library. from b4 means we are importing a library called import Beautiful Soup, Beautiful Soup from bs4 that we just installed. This library can then be loaded from anywhere at the bottom to use
the features of Beautiful Soup. The second one also means that it will take a feature called urlopen from urllib.request and write it below. You don’t have to memorize it, just use it. Then comes the full-fledged code. urlopen using the urlopen loaded above means that it opened this page. So it’s going to enter this address and put this in a place called respose. If you are new to Python, you may not be familiar with the with and as syntax. If you change it a little more intuitively, it’s contained here, so response is urlopen, if you got this, this is the same function. You can delete this and use this. I’m going to press the shift tab and put in one space. This is a little more intuitive. It means opening the url at this address and putting it into this variable called response. The next line is to put the response we made earlier using the function called Beautiful Soup and analyze it using html.parser. give it to a variable called soup. The next line uses a for statement, a loop used in Python, to find all the a tags in this soup and put them in a variable named anchor. Then this: It’s a sentence that repeats after you indent one space. Of this anchor, one of these anchor, get(‘href’) or address, the href of a tag are these addresses. It means get addresses and print them. So let’s go back to this site and see how this code works again. We’re actually in this url we created earlier. If you open the developer tools by pressing the f12 button, you can see the html code like this. If you are not familiar with the concept of html, please refer to the last html lesson 1 video. Here’s some code that uses .findAll (‘a’)’ to get thetag, which brings the linktag, an’d print out the “href” part. If you look at the code again, you can put the url in the response and make a respone into soup using Beautiful Soup. Then find all the .findAll (‘a’) a tags among these soups and looped through a tags one by one and outputted href. So as you saw before, you can see that the address part is printed all over here. How do you understand this flow now? So let’s do something else with this. As an example, let’s create an example that crawls NAVER real-time search ranking. To do this, you need to pick only the live query part from the entire Naver site and import only the text. You need to figure out and identify what this part looks like and choose to bring it here. To do this, press f12 and use ctrl + shift + C to figure out what the rule of the search ranking code looks like. If you look like this, here’s the main battlefield here, here’s the second place, and the third place is inside. Here is the 3rd place and you can see it progressing like this. There are many ways to find a pattern. If you look at these terms, they are all wrapped intags. However, if you get all thetags and print them out, thetag will not only be this one, but there may be atag at this number or elsewhere. So we need to narrow it down so that only that part is selected. Then, if you look at the features with more search terms, you can see that the class is all “ah_k”. So if you get& class=”ah_k”, it’s likely all of them are search terms. So let’s write this in code. If you load Naver instead of Wikipedia in the url part, you can import ah_k frominstead of findAll (‘a’). Then, if you are not sure about this function that you get from this soup, write the function by referring to the official document. You can enter the official documentation site by searching Google for Beautiful Soup. In the official documentation, there are various ways to select html elements using Beautiful Soup. If you look at the concept that we often use as a select, we use a function called .select a lot when we look inside CSS selectors. This gives you a lot of choices on how to choose. For example, if you put soup.select followed by tiitle, that is, the name of the tag, you can select as the name of the tag. Or you can select p: nth-of-type (3). If you’ve heard the last CSS lessons, you would know, it’s the same syntax for deciding which elements to select and decorate in CSS. For example, here it says select (“body a”), which is the CSS syntax to select thetag among the descendants inside thetag. And this is the same for Beautiful Soup So what we want to use is the part we chose it with CSS, so let’s copy this and change it to class instead of this part of selecting thetag, and instead of sister we have to choose ah_k. And if you put a span in front of it, it means to select. And since there is no hrefwhen printing, I will delete this part and print it out. If you save and run it, you can see the search rankings was loaded well. It would be good to only extract content from it. To do this, I searched to see how to use .get, which is the function I used earlier. I can pull out only the href part, and there is a function called .get_text (). Copy this, paste it after the anchor, and save it. Now you can see that the text only appears when you run it. I’ll add the rankings here to make it look more nice. You can say i=1 and add i. Since this is a number, you can’t do numbers + letters, change it to a letter and add. Then increase the size of i by one and now the rankings from 1st will be written down. If you run it, you can see that the search ranking has been crawled from 1st to 40th. In addition, we will not only show this on the terminal screen, but also save it directly to a text file. To do this, as we always do, we don’t just memorize all the code, but search for the language name + what we want to do. Here we are going to write python, and we are going to write a text file. So, if you search for write text file, you can find the relevant documents. If you go into anything, the file writing example is down here. If you write code like this, it comes out like this. Copy and paste this example code into the bottom of the code you have written. If you delete the comment and look at the code: open a new file here, run the for statement, and this is the line of data. It’s a simple piece of code that keeps writing this data to a file and then closes it after the data has been written. To apply this as it is, write this open file at the top of the for statement. We’re not in the c drive, but in the cloud, so let’s delete it and make it here right under this folder. And the for statement can be run as it is, and the data we print can be used instead of print. Similarly, f.write (data) can proceed the same. If you do the same and close as well, you move the code here. When you save and run the code as it is, a file is created instead of being printed to print. Search rankings have been coming out of here, but it’s all stuck in a mess. In that case, if you put ” n” here. This is a line break So it would be neater to make it a new line by automatically write a line break and recognize this character. If you save it and run it again, you can see that the files are saved well from 1st to 40th because of the line breaks. You can crawl the image in a similar way. In order to crawl an image, we need to find the rules for the image addresses from the images with the urls retrieved by this particular search, and collect these image addresses and save them as if we had saved a text file. But don’t you think: Crawling an image for a particular query is probably something that many people have already done and many others will have to do. So why didn’t someone make this work and put it in a library? If you already thought so, you are deeply versed in coding. In coding, if you think ‘Oh, this must have been done already.’ then someone has already done it. So there are obviously people who have that code or have it even made it a library. So finding and using these things can save you a lot of time. If you search for language names and behaviors like this, the first library called google_images_download has already been created. Using pip install google_images_download, we learned, using this library makes downloading images incredibly easy. First pip install, copy and paste this statement that installs the library into our terminal. Then the library will be downloaded and installed automatically. The installation is complete. Then this also has here, examples, the examples. I’ll bring this as it is. I’m going to import this code sample as is. Right-click>New File to create a new file. I’ll create a google.py file. And paste the code you copied earlier. Save it and since the library is installed, let’s run it. When you run google.py, it automatically runs something. After waiting, the code has finished running. When I went to the download folder, I downloaded 20 pictures of Beaches, Polar bears, and baloons, according to the three keywords in arguments was separated into each folders. If you open it up, you can see the beach photo was downloaded well. If you open the Polar bears, the polar bear photo was downloaded well. In this way the correct balloon image was well downloaded. Then you can just change this keyword here and change the limit and the number of images. Coding is like this. If you use a well-built library, you can literally copy and paste it and use it right away without having to change any code. This is a list of representative animal look-alike celebrities divided by the five animals I found while googled. Let’s download their photo data all at once using that code. The names of the celebrities of the animals are listed and separated by commas like the example. And the limit, which was limited to 20, was increased to 50. This will automatically accumulate 50 photo data for every celebrity here. And in the documentation there is a page called input arguments. Here you will find information on the various options available for the library. We use the teachablemachine, so if you search for the extension here, you have the option of specifying a file extension here with format, such as jpg or gif so that you can download only the extension you want. For example, when I download a gif here, the teachablemachine doesn’t recognize it. So we can safely specify only jpg files so that we can only receive them. Then I’ll copy and put the format to import inside arguments and set this as jpg Save and run the code. It’s finished. The rest has been received, but sometimes you can’t download because of an error. In such a case, if the download fails for the keyword, just enter the entertainer’s name and run it again. You can see that a new folder has been created and the downloading went well. To save these downloaded pictures to your local computer, right click on Download>Download File>ZIP and the entire file will be compressed and downloaded to our local computer. Download is going well. If you unzip it after downloading, you can see the picture data is received well. Did you enjoy the video? If it helped, please subscribe, like, and turn on notifications. It encourages the video production. See you next time with more informative videos. Thank you.