# TensorFlow Full Course | Learn TensorFlow in 3 Hours | TensorFlow Tutorial For Beginners | Edureka

Hi Everyone, welcome

to this interesting session on Deep Learning

with Tensorflow Full Course. So before we start

with the session, let’s take a look at topics

for today’s session. So we’ll start today’s

session by understanding what is artificial intelligence. What is machine learning

and different types of machine learning once you understand

the different types of machine learning will look

into the limitations of machine learning and how deep learning solves it

and then we’ll look into the real life use cases

of deep learning. Once you go to the use cases

of deep learning will look into how deep Learning Works how neural network works

the neural network components and then finally get

into what is tensorflow. Once you understand what is tensorflow

will also look into how tensorflow works

the tensorflow functionalities and then look into

what is perceptron. So once you understand what is perceptron also

will finally end the session with perceptron examples. So today we have a special

guest Amit who is going to take the session forward so over

to you Amit. Hello everyone. Welcome to the course, which Deep learning using

tensorflow and my name is Amit. I would like to give a small

introduction about myself. My name is Amit. I have got around 12 years of experience in the field of

machine learning data analytics and deep learning and philosophy

for five years extensively. I have been working in the field of deep learning as

an independent consultant. I’m a mathematics graduate and have done my Master’s

in Applied Mathematics and have done numerous

certifications in the field of deep learning. I have worked on deep learning projects

involving NLP computer vision another structure data analysis

for different Industries and companies as

an independent consultant. If you have not installed

tensorflow and python, please go ahead

and install on your systems. So that’s the first requirement. So let’s get started. So first thing which comes

to our mind they’ve been a lot of emphasis on this term

core official intelligence. So let’s first try to That

what is artificial intelligence on a very high level and why we may need

it in first place for solving a problem. So let’s try to understand

with an example person goes to a doctor and he

wants to get checked it whether he got diabetes or not. And what doctor

would say is, okay, there’s some tests

which you need to get done and based on the test results

doctor would have a look and from his experience

from his studies and previous examples patients. He had seen he would be able

to evaluate the reports and say that the patient

has diabetes or not. So if you just take a step back and think I said

the doctor has experience. So what do we

mean by experience? The doctor has learnt? What are the characteristics

of somebody having diabetes? What if it would be possible if we can provide

this experience in the form of data to a machine and let

machine take this decision whether a person

has diabetes or not. So the experience which doctor

learned through his studies. And his practice what we are doing is we

are taking customers data who have with different reports and different parameters

on different things like the glucose count in blood or the weight and height and all these parameters

about a human being and based on that. We have fed it to a machine until that what are

the characteristics of a person who has diabetes and from this let’s say we have

1 million customers data. We have given to a machine

and let machine do the stuff from his experience

from a machines experience which comes from the data

or historical data to be precise and do the same task

which a doctor is doing. So what we have done is if you see from this example

what an artificial machine artificial intelligent machine

is doing there. It’s learning from

the historical data and trying to do the same thing which an experienced

and intelligent doctor was doing so this kind of area

our domain we can activities which human beings would doing. If he can make machine intelligent enough

to do the same task why we should create these

artificial intelligence based machine systems. There are some very high-level

they may be lot of points. But if you just

discuss a few points that human beings

have limited computational power and we guys may be good

in terms of classifying things. Like, you know, you can see your friend

in a group photograph and easily can sit

with your friend and who are others

you can easily listen to a language and comprehend what a person is doing but human beings

are not very good and doing lot of mathematical if you try doing good amount of mathematical computations

in your head. Probably if not be

very easy and second is that it’s not possible

for human beings to a continuously legit 24 by 7

a day for 30 days and Enos if we can make machine do

such stuff one. They would be able to kind of

do these computations very fast. And like I know we spent

a good amount of time in discussing the gpus

and I also mentioned that Google is talking

about a teepee. Machine transfer Processing Unit

which would be hugely changing the entire

Paradigm of computations and machine would become more

and more competitive or even better than human beings

in some of the fields. This is the formal definition, but I fi Loosely

translated it’s basically that artificial intelligence

machines are those machines which can do tasks which human beings

can easily do so things like identifying what’s written. Let’s say in a license plate

or playing games and I’m sure some of you

would already heard that machine have defeated the co-champion

in the chess players already. Now, we have digital agents

like Siri and others which can understand what we want them to do and can take Intelligent

Decisions from the text or from The Voice itself. Do the glycation is huge. And basically these

are very high level and some of the fields where deep learning is made a great great inroads something

like game-playing expert systems self-driving cars robotics

natural language processing and I’m fond of natural

language processing. I have currently working

on a chatbot on one of my projects. So this is one area

of interest for me. So they may be

different and new areas where we are implementing

all of you know, that everything every experience of human beings is getting

digitalized the kind of things you buy kind

of things you watch and what your preferences are. Who do you like on Facebook who you don’t like what kind

of movies you like and all these things in terms of reviews

being captured online. So once your data is going and captured online

there are systems which can analyze this data. So given this this huge

data generation as well as now we have machines which can process it and make some Intelligent

Decisions are available. So that’s why you will see there

have been a lot of emphasis now in last couple of years. A lot of new things

are coming in some of you who have been

reading these papers on different subjects different

architectures would know that most of these architectures are not Old it’s a very

Dynamic field every day. In fact on a weekly basis. You will be hearing

about a new API or a new kind of architecture being developed

by somebody most of the stuff. We will be studying

in these classes are not very old like

convolutional neural network and recurrent neural network. Some of their variants are

as new as as last year. If you guys follow

tensorflow closely, they introduced a library

called object identification object identification

object detection API, which tensorflow has made

available for everybody you would be able to seat yourself that this API works there

five different options of selecting different

deep learning architectures or convolutional neural

networks for this API, but it’s been able

to identify human beings and all other ninety objects

there with almost 99% accuracy in some cases from even

human beings would be finding it difficult to kind of see

and predict what the object is, but these Machine has gone even

Beyond human capability in terms of identifying objects given that we have

a fair understanding or very high level understanding

we haven’t gotten to details of artificial intelligence. But basically from

a lose understanding that artificial intelligence of making decisions or of

machine making decisions, which human beings were earlier

doing the task something like game playing and natural language

and driving’s of car. Let’s understand how

this machine learning and deep learning are related

to artificial intelligence. So basically artificial

intelligence is a bigger domain. So if you look at robotics, their their involvement of lot of things there

is mechanical inputs that how a machine moves

its some Dynamic kind of agents and all other things within

artificial intelligence domain. We have machine learning

for solving business problems. So you can think of it

on a very high level that machine learning is kind

of brain for artificial agents. So all the data we Get

process make machine learn whether it’s one class

or other class. These kind of decisions

have been taken or been learned through

machine learning algorithms within machine learning

deep learning is further subset of machine learning. So people who come from machine learning background

all the algorithms like regression logistic regression

decision tree random forest or support Vector machines. These were the algorithms you

would have been working till now and I’m sure these are

very powerful algorithms. Some of them are really really helpful in solving

daughter business problems. But the Deep learning is a different kind

of architectures, which do similar tasks. So when I say similar tasks, I mean that either predicting

some numerical number let’s say what would be the sale

next month for a company or if I need to do

classification task there. Somebody will pass

an exam or not, or if you need to identify

some objects within the image if I need to translate language from one type

of language to another one. So all those kind of activities

All problems which machine learning algorithms were

not able to solve till now these algorithms

are being able to solve when we go through

the Deep learning architectures would kind of being able to see that these machine learning

algorithms like regression or logistic regression

still are very much relevant in deep learning and lot of learning a lot

of building blocks of deep learning take intuition

from these algorithms. And this is one of the reasons that we have included some

of the basic algorithms like regression and

logistic regression in class so that we can explain

you some bit of it that how do we identify

in some steps in terms of how do you prepare the data

to make your model learn and what kind of steps you need

to take to prepare your data, which would be common

also not only in machine learning algorithms, but they will also be common in deep learning

architectures as well. So if you look at this, let’s say we have

Up on some flowers and flowers different aspects of it like sepal length sepal

width Peter length and people with these

are the features and features can be

information about a product or a customer anything. Let’s say we’re talking about a customer then

the information can be that what stage of the customer how much does he earn

or how much does he spend every month

on the Company products and all so these are

some information points. And once we give this data

your machine would be able to learn from the data itself, whether it belongs to species

of flower A or B or C or whichever different kind

basically what it is that earlier systems

any system were used for solving such problems people

used to hand code these rules that if this feature is that much then this

is class the A and if this feature is less

than their tits Class B and all that these things

these were hand-coded rules. However, when we

do machine learning as I was saying it learn

from the data and once it has And from the data we

can give a new input and when I say new and bit, I mean that the information

regarding different different features of the flowers the same

sepal length width and all and because your machine

has already learned it would be able to classify that which class this particular flower

belongs given this this learning that now your machine

is able to understand and learn from the data. We can solve multiple business

problems with the help of this. So let’s take a very

small example the same example, which we took earlier when we started discussing

about artificial intelligence. It was like whether a person

has diabetes or not and I was mentioning that this kind of decision

been taken by the doctor based on the reports. He has got and these reports

have some numbers like number of time a patient had

a particular kind of issues or what is the glucose count? And what is BMI

and what the person’s is and Bates of these numbers

the doctor was Able to make this kind of decision we

can take the same analogy where we were trying to protect

the flower which species of flower it is we can take

information of patients and different attributes on

different features of a report and the patient and the Machine

would be able to learn from these data sets and for

a new customer or a new patient, it would be able to classify whether the patient

has diabetes or not. So there are two sections of it. First one is the information

about the patient and different characteristics

of his health. So from this, which is number of time

and glucose count till age. These are the information points

about the patients and the last column

is the information whether the patient has diabetes

or not this kind of problem where we have

some information points where they explain

what the situation is and other in the last column or the information of Output is

in some kind of classes. There is a specific type

of machine learning problem it is but as of now

the characteristic is that we have some information about the patient and

the last column is telling me whether the patient

has diabetes or not. So what your machine

basically learns it that it learns all

those rules in the example, which I was quoting

that earlier cases people use to create these

hand-coded rules to predict whether an event

will happen or not. But in machine learning the

or algorithm will learn from the historical

data and see what are the combinations which decide whether a patient

has diabetes or not. And these combinations would be

of something of this type. It is only for illustration. Nothing to be it’s

not the real numbers, but it’s for illustration that your machine or your machine learning

algorithm is been able to identify these rules based

on the historical data. So it will lay after learning it has created the glucose count

is less than 99.5 if yes then go to next. If know the person

does not have diabetes if glucose count is greater

than sixty six point five. Yes, the person has

diabetes and if no, then they are further

drill down of rules. So all these combinations

or rules are dynamic in nature. And what I mean is that

these rules would be changing if your data says changes and you can take the same

model can do the work whether a patient

has diabetes or not and you can take the same model and make it learn

on a new data set. Let’s say flower species. It would be able to learn

the new rule from itself. So the intuition like human beings

were learning from examples your machine learning algorithms

also learn from examples, but just to frame

our problem statement that machine learning we

know it learn by experiences in from the data

from the historical events. There are three kind of problems

which may be interested in solving first one is called

A supervised learning problem and supervised learning problem

is basically occurs when you have

some input variables and one output column, so both the examples which we discussed till now

one was on the flower species where we are taking data

on different features of a flower and then

which species of Flora tours. So the last column is

the dependent variable or output variable. We are trying to predict and all the information variables

are called input features or input variables. So input features our input variable its kind

of interchangeably been used in different texts

input and output. These are the two different

sections of supervised learning. Why does called supervised

learning another take if I need to explain it that we have a column

to guide the algorithm whether it’s making

the correct decisions or not. So let’s say your model says

person has diabetes, but the actual data says the

person does not have diabetes. So you have some kind

of Correction mechanism. Within your data itself, which can help you

model tune itself to make better predictions. So this kind of output

variable some text. It’s also been called

a teacher variable. So it’s guiding the algorithm

to decide those rules, which I just go

through another type of machine learning

is called unsupervised learning and in unsupervised learning

we have only the input features and our objective is that we should be able

to identify the patterns within the data itself. So some of you who working

in Telecom domain or marketing campaigns, you would be very much familiar

with segmentation analysis or cluster analysis where our objective is

to identify coherent groups within the larger population. We take the customers

as it is the whole population and based on different

parameters and variables. We identify some of the groups

of customers or products, whichever the business

problem is to identify which are similar in nature. So that we can take

either marketing campaign or develop new products for those specific groups final is called a reinforcement

learning reinforcement learning is a kind of learning where the agent learn

from the environment. So it works kind of reward and penalty you can think

of as self flying helicopter. So you leave it

in the environment and it will be deciding based on the wind speed and other

parameters in the environment that how much it should fly

in the reward is that the fuel should

be efficiently be spent and in a more time. It should be spending

in the environment. So supervised learning as I said

that the objective here is that we have some input features and would features

would be holding information about different aspects

of a given problem or customers and we

have an output variable which would be explaining whether the event happened or not some kind

of output variable. So there are two types

within supervised learning one is called a regression

second is classification in the differentiation happens

only because of the type of output column if your output column is the type of numerical values or

continuous values like numbers, so that would be

a regression kind of problem a very

high intuition level. For example, you’re working on a problem

where we need to predict. How much would be

the sale of your company given the information that how much they’re

spending on marketing how many employees are working

which month it is of the year. And if you have this information

you are going to predict what is the million dollar

of sales to company would be doing so

these kind of problems where your output variable is

of numerical values, then it’s a regression problem

on the other hand. If the dependent variable is of categorical nature or of

discrete values it signifies that it’s a classification

problem given that it’s a claw categorical values. Your objective is That how you can put the different

customers or products into different classes. So that’s why the name suggests

classification problem. So as I said there to kind

of supervised learning problems one is regression

and other one is classification. So let’s take a use case where we need to predict

the housing price of a particular locality and we have information

about these houses on different parameters and

these parameters are like these so let’s say what is

the crime rate in that area? How old is the home? The distance is

how far it is from the city? This is from

Boston Housing data. So this is if I’m not wrong, it’s percentage of black

population or some variable. We have the description later on but and what is

the actual price of the home and all these features from crime to I state is

the information about the house. So these are my input features and the output feature is

the price of the house. This is in millions. Dollars and our objective is that we should be able

to fit an algorithm that it should be able to learn

from all the historical homes, which were sold based

on all the features. And what was the price

it was sold for and once the model is strained. It should be able to predict that how much would be

the price for a given home. Let’s take an example. Let’s say one of you is

interested in buying a home in the Boston area

and you would like to know that what is the ballpark figure

for a two-bedroom flat which is of some square

feet and let’s say 20 miles from a specific location. What should be the price so

one way would be you go and talk to people and try to understand that what has been

the average price or if you have this kind

of algorithm available, which can help you understand that given these features

there was the price and if you can create

a regression model it would be able to help you that given some features of the new home which you

are interested in buying what should be the price of it. So as I said, there are two sections one is

the independent variable. We’ll all these information about the home and the last

variable is dependent variable which would be information about

what was the price you see. This is the kind of scatter plot

between the distance from the city and price for the home keeping

all other variables constant. We are not looking

at the influence of other features, but we are looking

if we just need to model or we need to find a relationship between

the distance from the city. And what is the price

from this graph? You can make out that further the city houses

the lower the price would be if we keep all

other things constant. So here it’s like if we can identify this kind

of relationship though. It’s called linear regression. Maybe the line is

not very linear. But for a given distance from the city you

would be able to predict what should be the ballpark

figure for a house. If you just have this information not

all other information, which we have talked

about in a similar fashion how it’s been done is that they would be a relationship between

the price of the house and all the features

which we discussed. So this is only a relationship between one of the variables

distance to the city and the price of the home

in a similar fashion. We would be able to find

relationship between the price and all other features. So all other features if you know like

how old is the home how big it is? What is the crime rate

in our so this kind of model is called

a regression model and it’s a very basic equation

of a straight line Y here is called the dependent

variable a is intercept and B is called slope and X

is called independent variable if we go deeper and

try to understand what it’s basically doing is this equation

is trying to tell me that if I already

know the relationship if I know the value of A

and B from my historical data, which is about different homes

given the value of x I should be able

to calculate the why why in our particular example

is price of the home. Let me try Give an example. What slope means

so slope is what? Is the change

in dependent variable if we change X, which is the independent

variable by one unit. So let’s say if I change X by 1 unit

how much change in happen in why help me understand? What is the kind of relationship between X

and Y and a is the value which tells the value of y when the x is 0 and you can think of it

something like in the text. It used to be like that. If you put the value of x equal to 0 whatever

the value is y then that’s the intercept but basically from intuition

perspective you can think that an intercept is the value

which is there even though you don’t have

any information about X, for example, we were discussing the relationship between

the distance from the city and the housing price even though the house is exactly in the city then they

would be some value and even though the house is

a hundred miles from the city. They would still be some price

so it helped us. End of intentionally

understand the relationship. It also helped

the line to understand where it start weather

is start from the origin or some place within your exes. And this kind of question

is called equation of linear regression because if you see here

the power of X is 1 and so that’s why it’s linear

in nature and it is also that we are fitting

the relationship between X and only one of the variables

multiple regression where what we do is that instead of

finding the relationship between only one variable and the dependent variable in most of the Practical

scenarios the dependent variable Y is dependent

on more than one features. So for an example, your house price is dependent

on all these features all these information

points available. And all you want is that your regression model

would be able to identify relationship giving all

the information together. And then predict what is the price

the equation becomes y equal to b 1 x 1 b 2 x 2 or B 3 x 3 this kind of equation

where B B1 B2 B3 and all these coefficients

help us understand that what is the contribution

of a single or if a given variable

into your regression equation? So let’s have a look that how you can fit

this model in Python. So if you look

at the first block of the code where we are saying import Panda as pain VD import numpy is NP and import numpy

plots live as PLT. This is the convention in Python

to import some of the libraries which we would be using. So these are the libraries which are required

for running this module or this this aggression model. So once we import these all the functions

available in these libraries, we can call them very easily

and we will be seeing that how you can call them. So once you have

imported these libraries if you see that Are loading

the data called Boston and this is the same data set which we have been discussing

in terms of use case here. So next line of code. So here we are importing

the data and loading the Boston data set

the next line of code. We are calling

the pandas library because we have imported

the pandas Library SPD then we are calling a function

called Data frame so that we can, you know, create

a data frame in Python and we are creating

it Boston or data. So it will be creating boss

as a data frames a data frame on a very loose term. You can think of kind of

spreadsheet kind of format where your data is being put

in rows and columns and you can think of an Excel file kind of framework

for the data frame though. It will be different

but just for intuition purpose and after importing I

am calling dot head. So what dot head does it

will be giving top ten rows of my data set. So there are all

the Thirteen columns in Python index

starts from zero. You can see

that index started from 0123 so you can see what this data is

this line of command which says dot columns so dot column gives

you all the features available in your data set. So these are

the different feature names. So these are the different

column names for the data and the price

in the end of the code. I have written actually

one line of code, which can give you all

the details that what the target variable is. What was the history of data where it was recorded and also

you can easily look at this. We are calling

this Boston dot Target and we are calling it

as a boss dot price. So we’re creating a column

in our data frame which was boss

from Boston targets. There is another data Vector available in

the Boston data set itself and now specifically we

are saying Y is equal to this particular variable y we will be representing

our dependent variable. All the features

plus the Boston price. So boss don’t drop price X

is equal to 1 it means that we are dropped

the price variable from the overall data frame

and X is one specified that we are removing the column. So x is 0 represent

the row level operations in Python and X is 1 represent the column level

operations print statement. We are just printing

now the X so this x is all the input features

of our data set. So all these columns which we will be using

for predicting the housing price and how it will be working. So it’s not natural model, but what actually happening is once we have created

the model it will be doing that will be fitting a line which would be going through the actual data set

would be something like this. The price is equal

to some intercept term plus B1 X crime be 2 multiplied

by another variable Z and then be 3 x other variable. So on and so forth and this intercept

and b 1 b 2 and b n would be the coefficient which your model would be learning from

your already available data. And here we are showcasing

the top five values of our housing price. So Y is the dependent variable and we are looking at what

is the top five values? So this was only a very brief

and very basic introduction that how do you import our data how you can see what are

the different columns? It has nothing to do

with machine learning but it is only for people who are new to Python and for

people who have been out of touch in Python, if you just want to brush up

skills and this line of code if you have a look which I’m highlighting

now it is we are using a scikit-learn model

for test train split and what it’s doing is for both because we have already X&Y the test sizes. We are saying .33, so it’s basically we are randomly selecting

33 percent of the data for Putting it separate

in the test bucket so that we can test it later on

and this random State 5 means because we will

be randomly selecting. If you specify a random state

every time you run this code, you will be selecting

the same set of elements from your data set are

help you understand that the variation if you run the code multiple

times by changing the variation is not coming because of

the selection of sample. It should be because of

the different model changes you are making this dot shape

function in Python specify that what is the dimensions and if I mentioned

the first one X trained or trip is giving

me three thirty nine and thirteen basically, it’s telling me

that there are 339 Bros and thirteen columns

in the data set X test. There are 167 Rose

and thirteen columns. And your why test

is just 339 Rose and there’s just one column

or it just one vector the number of rows in X train and why trainer same index test and why test the number

of rows are same because they have been Elected

for the same combination. So same houses. We have selected

the input features as well as the corresponding values

of output and same has been done for the test section. What we are doing here

as I was saying that we have imported

a library called scikit-learn and scikit-learn has different modules for different

machine learning algorithms in linear regression is one

of the modules in scikit-learn so we can call the scikit-learn module from

linear regression called LM and this equal to NN pythons

call assignment variable. So we are assigning LM as

a linear regression module in the scikit-learn. And now what we are doing is if you look at this line

only LM don’t fit. So basically we are telling that use the linear

regression module from scikit-learn and fit

the model between X train and white rain. So basically what we

are telling the model that you learn

those coefficients for different X values given y values

in the training data set. Basically what the fit function

does it calculate the values of intercept b 1 b 2 B 3 4 all

the features in your input. Is for a given y variable. So once we have fit

in the model, and once that has been fit

we can use the same model for doing the predictions. So let me remove it. It should be like this. So LM fit we have fit

in the model and once there has been fit we

can use the learnt model which is LM with a function

called predict X train. So what it will be doing is once it has learned

those coefficients B intercept and B1 B2 B3

for all the features and input you can use

the predict function for making the predictions

for your training data set and you already have

actual values as white rain and then you can compare that how good your model is doing and how you

can compare it there. If you look that we are put

together same thing. We have used for the X test data

set LM not predict X test. And again, the prediction

has been done. So if you see here, I have put together as

a data frame why test and why test bread and

the difference look like this. So the Just value which was 37.6 and the actual value was

thirty-seven point this value then predictive value

this actual value is this n this difference between the actual value and

the predicted value signifies that how much is the error

in your data set? So head been that your model is giving

the same prediction as it was the actual value

you would say there is no error model is

a hundred percent accurate and all the predictions

been made by the model is absolutely a no bang on but normally doesn’t happen

you end up having predictions which are a bit off

from the actual value and we measure

the difference as one of the characteristics or one

of the parameters to identify how correct your model is

in this particular statistic. There are two Matrix

been used for identifying by the most basic been used

is called mean squared errors and what mean squared error is

it is basically the difference between actual You and the predicted value by the model and

what I mean by this is that there’s a this is your predicted value 37.6

and this is your actual value. What you do is you take

the difference of these two and then take the square of it

why we take the square of it because this sign some values

the difference may be negative or positive and if you sum it

up the difference may come to 0 and you may end up thinking that okay model is doing

really good stuff to avoid it. What we do is

that we take the square of it so that the difference

between actual and the predicted becomes positive and you

can sum it up to Showcase that how far your predicted

values are from the actual value and then you take a mean

of it be to Showcase that what is the mean difference

between the actual value and the predicted value

it can also be used for model comparisons here. I can show you

that how it is working. The let’s say you have

some actual values something like this and let’s say

you fit a model. I fit a model it’s there is

one model prediction one. Another model is predicting. Do so, what you can do is

you can take the difference between the actual

and the predict. So 10 minus 2 is 2 and then

you take the square of it, which is for 23 and 21 again to square

of 2 4 then third one. The difference is 5

and then square up is 25 and so and so forth for all the values

you get the total value of sum of squared errors and you divided by

the number of inputs, which is 5 and you get the mean

of the squared errors and you do the same thing

for the second one and if you see it is very less five. So probably it would be able

to help you understand that which model is doing

a better job in terms of predicting the housing prices or any other numerical variable

and there is another statistic which has been

used for identifying how good your model is which is called mean

absolute percentage error. And that’s basically

the absolute difference between the actual value and the predicted

value absolute terms and you sum it up all the values

for all the entries and By number of all

the value of absolute value of your actual values and it

can help you understand that. What is the average percentage your values the predicted

values are different from the actual value. So sometime if you see it will

be somewhere in percentages. So what I’ve done is I have

taken the absolute difference between this value and the predictions I sum it up and divided by some

of my input values. So whatever value comes

in you can say, okay, it’s five percent. So it will be fair to say that your model is 5%

of from the actual values or the error term

in your model is let’s say 5% or 6% and whatever predictions

you’re making from the model, you can keep a buffer

of that percentage when you share it

with the team and what I mean is the let’s say

your mean absolute percentage error is around 10% and it’s

about the sales of a company. So when you share

this focused you say that my predictions

around 90% accurate they may be actual sales may be

plus minus 10 Percentage. So this this can help you giving this kind of variability

in your predictions. However, you have implemented code in Python itself

or the scikit-learn library. You can call mean squared error

the function from sklearn and it can help you calculate

the MSC for a given model. So basically there was a very quick introduction

to linear regression though. There are

different applications, but one thing remain common that we are trying to predict the dependent

variable whose nature or the type of dependent

variable is in numerical or continuous data some of the applications like predicting life expectancy

based on these features, like eating patterns

medications disease Etc You can predict housing prices. We have already

seen the example on that we can predict the weight on different features like

sex weight prior information about parents and all and you can also predict

the crappie love crop based on different parameters

like rain Fallin all And as I said, this is like very

limited use case list. I’m sure people who are working

with sales department. You have to make predictions. How much would be

the sales people who are working with call

centers you need to predict. What would be the number

of calls for next month people who are working with the

marketing you need to predict. What would be the footfall

in a given company or a mall? So there are

different applications of regression models, but one thing is common

across all these applications is that the dependent variable which we are trying to focused

is of continuous data type. So let’s get moving

to the next agenda for logistic regression. So the time of the introduction

to machine learning we discussed there to kind of

supervised learning techniques one is regression and other one is classification in the major differentiating

factor between the two were that in regression. We held a dependent variable

of continuous values and in classification problems, we had a dependent variable

of categorical types. So let’s take an example

of how we can do it. So here let’s take a use case where we have got some

information about some customers in the data set looks like this that we have some customer ID or user IDs gender

of the customer or the user his or her age

estimated salary every month. So you can think of in any one of the currency

either INR or dollars and whether this user

purchased an SUV or not. So as I was saying earlier

the dependent variable here is 0 and 1 so it’s a discrete value

or categorical value, which we need to predict

end the features which we will be using in the

model our age and estimation. We have already discussed

the linear regression why we would need a logistic

regression kind of algorithm. It would be

a straightforward process that if I take purchase as

a numerical value 0 and 1 and I take some input

features like age and estimated salary and you will be right

in saying to some extent that this is a possibility

of doing it. So there are two

major problems coming if we follow this and some of you can help

me what maybe the problems if I try using the linear

regression for solving this kind of problem, but one limitation I

can think of is that here. I’m looking for an output which can give me some kind

of probability that house. Likely, I am to buy

a product or service. So one thing that the limitation

or the Restriction with probability is that the probability terms should be between

0 & 1 & 0 signifies that there is no probability or there is no likelihood

of event happening and one means that it’s certain

that the event will happen. There is no possibility that we can have

probability values less than 0 or greater than 1 so if I’m fitting a linear model

taking the purchased column as my dependent

variable my value is because the linear regression

has no such limitations can pass these values Beyond 1

or less than zero. So what I require is that I fit the model

in the similar fashion, like I did the linear regression

the equation I used earlier. Then I want that information

would be coming from my features in the similar fashion. But what I want actually this is that this why should be mapped

to the values between 0 and 1 and given the limitation we have

just talked about probability that it should be between 0

and 1 it should not go beyond 1 and less than 0 I

need to find ways if I can kind of force fit

or kind of force this y value which would be coming

from this equation and I force fit into a values

between 0 and 1. So to solve this problem. There was a function called

sigmoid activation function which would be extensively

being used in our deep learning as well a different places. But logistic regression comes from this

activation function itself, which is a function

looks something like this that output value

would be 1 divided by 1 plus e to the power minus X and X is not actually the one of the input bits any value

we are giving it and if you fit any value

into this particular equation, it can convert any value

between minus infinity to plus infinity. It will map it

to between 0 and 1 if your value of x

which you are putting in here. I could have selected

a different value different name at least but if you give the highly

negative value the output would be very very close to 0 if this input is positive then

the output would be close to 1 if the value of x

the input here becomes 0 what would be the output 1 by 2 because any values power

0 is equal to 1 + 1 divided by 1 plus 1 would be equal

to 1 by 2 or Point 1/2. So logistic regression is

nothing but an extension of your linear regression

itself with one additional fact that you want to force fit

your output between 0 and 1 and for that you

are using Activation function called sigmoid activation

function or sometimes. It’s also been called

logistic activation function to do the same task. So this is an intuition

behind your logistic regression where you take values

of your equation from intercept and different

coefficients for your input. And you may have these outputs

between 0 and 1. So once we have understood that logistic regression is

nothing but their extension of your linear regression only with the restriction on the

output being mapped between 0 and 1 we are shifting

have you are fitting the regression equation, we would be having scenarios where the value would be going

Beyond one on less than 0 and to avoid the scenario. We fit in logistic

regression with the help of sigmoid activation function, which looks like this

and if you see as I was saying when your value of your model go

beyond let’s say this is our zero so all the values which are positive

and greater than 0 the curve goes. Tangential towards one

and for all the values which are less than 0

it goes towards zero and at the place of 0

the probability is 0.5. So it’s a 50% probability if your output is very

much close to zero or it’s zero. So sigmoid activation function

is been used for that purpose. It can be used

for multiple scenarios. One of the example we

are taking is the example whether somebody will buy

an SUV or not. But if you’re trying

to solve problems, like somebody will say yes

or no to a product or service or whether something is true

or false or high low or any different categories, but logistic regression

can easily be put in for multi-class

classification problems. And basically if I just give

you a very quick introduction how it works is there

in multi-class classification is it kind of does mapping that one class versus rest

of the other classes and then same analogy follows that which class a particular

event would be associated with but end of the day

before which Class or category the

probability is highest. The model will predict that it should be belong to

that particular class like MSC. We have a statistic

or a parameter to evaluate how good your model

is doing to compare how good your model is doing? You have MSC mean squared error. And that was a parameter

to check that. What is the difference

between the actual value and the predicted value

and how we were doing it. We were taking the value which was actually subtracting

the predicted value taking square of it do it

for all the examples and divided by number

of training examples we have and then it gives

you some number and I was also saying that this number is helpful

in kind of comparing different models. So let’s say you fit a model. I fit a model and we compare

MSC for both of them. Whichever model is giving

me a lesser MSC. It is kind of an indication that probably your model

is doing a better job in terms of prediction

then In a similar fashion, we needed a kind

of statistic to see how good your model is doing when your model is doing

a classification problem. So here there

are four categories that let’s say we have only two classes good and bad

actual values good and bad are what your model

is predicting good and bad. So for examples

which belong to good category and your model is also

predicting them good category. So this type of events or examples for

called true positives because your model

is doing correct prediction on positive examples

another category, which is your actual value

for those examples is bad. They belong to bed category and your model is also

predicting them bad. These are called true negatives and these are correct

predictions because whatever the actual value is, your model is also

predicting the same thing. However, there are

two categories. Degrees where your

actual value was bad, but your model is predicting

good these kind of examples are called false positive because your model

is falsely predicting then these are good examples and another category or last category

is called false negative where actual value were good. Any model is predicting bad. So, how do you learn? How does your model say

that which model is doing? Good job. So what we do is we calculate what is the percentage

of values examples have been predicted correctly

these sections in blue, true positive

and true negatives. These are the examples which your model is being able

to predict correctly and these two groups false positive and false negative are

the incorrect predictions. So what we do is we

just want to take what is the percentage

of correct predictions and this metric is also

called confusion Matrix. Let’s say there are some examples out of which

65 examples with their where actually they were

good category examples and your model is also

predicting them as good class good category

examples 44 are those where they belong to the bad category then a model

is also predicting them bed. This is 44 and eight

are actual bed and prediction is good

and for are actually good and predicted bed. What you do is you sum up all

the correct example 64 and 44 and divided it by all the examples in your data set

all correct and incorrect ones and here you get 89% So all you can say that your model is being able

to predict 89% accurately or if you want to explain it

to your business team and say that whenever I give

you a prediction that hundred customers

will churn and I give you a list of a hundred customers

I can With certainty that at least 89 will churn

from them with some certainty. So because your model has given

you 89 percent accuracy. So that’s how it’s been kind of

communicated to business teams that we are thinking that our model is 99% accurate and whatever prediction we are

giving we are very very certain but if your model accuracy

is 70 or 60 percent then when you give the predictions

to a business team you say that okay though. We are giving

you the predictions but we are not very certain whether it will work

correctly or not. So this accuracy percentages

in a similar fashion like we did for linear

regression smsc to identify. How could the predictions are

your accuracy percentage is at the metric to see how close or how correct

the predictions has been. So now we can see the implementation of

logistic regression in Python. So first few lines, if you see we are importing the libraries or

the machine learning libraries, which we require to do

the data manipulations. We are importing the data

which is a CSV format and this data is already

available on your LMS. If you want to import

you can easily import from the LMS itself. Unlike the Boston data set which we were importing

from the library itself here. We have got a flat file as social network ads dot CSV and you can call a read

underscore CSV function of pandas library

to import the data. So you are importing

the data as data set. And as I said head showcase

the top five rows of your data. So here we have only

five columns one is user ID gender age and celery and the last column

is our dependent variable which signifies whether a customer

or a user bought the SUV or not. So it’s 0 and 1 and 1 means the person bought

in the previous code. We used one Convention

of selecting X and Y here we are showcasing

another way of selecting that’s called I log so we are looking for

the location and this convention if I go through

what we are doing here that this is the data set

within the data set we are. Specifying the locations

this colon means that we want all the rows

and as I was saying earlier that in Python

the index starts from 0. So what we are saying

we want column 2 and 3. So what we mean

that this is 0 this is 1 this is 2 and 3. So we want as our input features two

and three and the values so it will be creating an array of these two volumes we

could have used gender, but I will leave it

to you that first. We need to create

the gender as a vector of zero and one so you

can create a function which will say okay if gender is equal

to mail then one else zero or you can create dummy

variables there are function available in scikit-learn. So it’s an exercise for you that this is the code

already available, but I would encourage that if you can also

include gender information into your model the next line of code why we are seeing

the dependent variable is all the rows and column number four. So column number four is

your purchase information whether a customer bought

the SUV or not, and again the values to convert

into a kind of list format. So we have specified two things that two columns

is the information about the input features

all the information about the user in terms of

how much money they make and what their age is

and information of why whether a customer

bought the SUV or not and the next line we are doing

the train and test split for the same stuff to evaluate whether the predictions

been made by the model on the training data set on which the model

was learned is still doing the correct classification

on the data set or the test data set which was not involved

at the time of training and this .25 means that we are selecting 25%

of the data for tests and remaining 75%

for are trained to what is the correct split

of train and test. Normally it is correct to choose

between something like 60/40 or 70/30 or 80/20 if your data set is big enough

then I think having 80/20. Of split is good or whatever. You can try

these different combinations. But as a rule of thumb most of the time I have seen people

taking something like 60/40 or 70/30 kind of distribution between actual value

and the predicted value. Now there is one important thing

for data pre-processing and this selection which I have made for doing

the data pre-processing and some of you who come from the machine learning

background will already know that how important it is

to kind of scale your data. And what do I mean

by scaling there? If you look at the data set which we are using

for input 1 is the age column in second one is the income column age

can be somewhere between let’s say

1 200 a 120 at Max and your income is in like some thousand and

some hundred thousand numbers. Both. These values are on different

scale scaling your data on let’s say all the values between zero and one

will help me understand that what is the importance of each variable

for a for example, if you look at a regression

equation and see those coefficient b 1 b 2 for all the input features these feature these coefficients

can give you kind of indication that how important

a particular variable is. But this intuition

will only be correct. If all my features were

on the same scale if these features like age and celery when they are

on different scales, you will not be able

to compare what these coefficient really mean because they’re two different

scalar values come from so it is always a good idea

to have all your features on the same scale. There are multiple ways of doing

it and they’re multiple type of scaling parameters. The simplest one is called

min/max standardization. And what does it mean is

that for a given column? Let’s say we are talking

about age column, which is 1935 26 if I need

to do min max standardization. What do I mean is

that I take the value it is that’s a 19 and

minimum value here is let’s say 19. I have only these five values. So how it works is that this is the formula

this is the value or how it’s being presented X i- the minimum value of the column. So let’s say age,

/ Max of age – minimum of range as well. But basically what

this formula will do if I do it for all the values

in the age column, it will be converting

all the values between 0 and 1 and there are other ways. I also said that there are

normalization process which is like you take the value minus the average value divided

by the standard deviation if I call it correctly. So whichever method we

apply all I’m saying is that these values of age and celery should be brought

to the same scale. So if I’m applying

this min-max standardization, I’ll apply to both my columns so that both these variables

are on the same scale and I should be able

to use them in my model and this is again a very important thing that whenever

you do standardization, you will be using this process that you fit the normalization or standard scalar

on the training data set and you use the same learn

to standardization from the On the test data set

but basically how it helps that your data set on which

your model is been trained. It will be converting the values

between zero and one based on the minimum

and maximum values if the test data set

have different minimum and maximum values. It can have different value

for the same number. So that’s why the process is that we make

our standardization fit on the training data set and use

the same minimum maximum value for test normalization as well. It gives the same scale

for all the values and for model predictions. It’s very helpful. In a similar fashion, like

we called a linear regression object from scikit-learn

in the previous example, in exact same way. We can call a logistic

regression function from the scikit-learn. So it’s a scikit-learn linear

model and we are importing contest aggression. Now, we are fitting

the logistic regression between X train and why trained the same way

we did it earlier for the linear regression. And once it has been fit

we can do the prediction for test data set. We are also doing

the same thing that we are calling the function which was classifier

for the logistic regression and we are doing it

on the X test data set and here the default

probability is 0.5. So what your algorithm is doing in the back end for all

the examples wherever the probability in X test

became greater than 0.5. It was tagged as

one and for all the examples where probability was less

than equal to point five. It was tagged as 0 and now

Are calling this function called confusion Matrix

between why test and wipe red so we are comparing that what is the values

of the or true positive true- false positive

and false negative. So this is the values that these are

the true positive. These are true negative and these are

the Miss classification values. And if you want

to calculate the accuracy, you can easily do it by 65 plus 24 divided by 65

plus 24 plus 8 plus 3. So all we are doing is we

are trying to identify. What is the percentage

of correctly predicted numbers? And this is the code of section

so it can do the prediction. If you see what it has done. Basically if you

look at the section that your regression model

has fit this line and you can see

the straight line and that is why in some

of the text logistic regression is also being called

a linear classifier and why it is called

linear classifier because it is

predominantly being made. For fitting a linear equation the logistic regression equation

was y equal to a plus b 1 x 1 b 2 x 2 and all

these coefficients and respective inputs. But the highest power

of your inputs were one and you would already know that if it’s a polynomial of power one it stand

for a straight line. So that’s why you

can see a straight line. There are ways some

of you would argue that in a we can fit

a nonlinear line with the help of logistic regression, but you would also concede that there are some tricks

which we use for creating non linear lines

through logistic regression. For example, you introduce

higher order polynomials into your model so that the separation

becomes nonlinear these kind of algorithms are really helpful

only solving the problems when the objects

are linearly separable when the separation between the objects is

not linearly separable. These kind of algorithms

are not very helpful, and we need to identify

algorithms which can fit in no. No. A hypothesis or

separation boundaries between different classes. So let’s take a use

case to understand that what are

the simple scenarios where unsupervised

learning can be used and how does it really work? So let’s take an example that we have some housing data

and housing data in terms that what their locations are and these white dots

on the screen in the blue background showcase that where these homes

are located and the objective of Education officer is that he needs to find

a few locations where the school’s can be set up

and the constraint is that students don’t have

to travel much. So given this constraint in mind the officer needs

to decide the location. They may be easily

we can identify if we are not using any algorithm. So let’s say if I know that I am an officer I

need to open three schools in the locality and I

know the information where the homes are located

I can easily see. Okay, probably this

is one location. I’m just highlighting it and the constraint I

also Mentioned that student don’t have to travel

much what I mean is that if you open the school here

then everybody of you would say that it’s not a great location

for a school given that it’s far away

from the population. So this is not

the correct location and from the perspective of identifying the home

probably these three from a human intervention or like some of you

has been given the task without any algorithm. You can decide the probabilities

are three locations. If you set up the school, most of the students would be

traveling less to go to school. So given this problem

we can easily see that we don’t have

a dependent variable as such which is telling us whether it’s the correct

location or not. All we are doing is that we have number of locations which we need to find schools

for and then we have home locations and based

on the distance of each home. We need to identify which with the proper distance

a proper locations of these schools. And another thing that is coming

from the same logic that there’s no predefined

classes of these locations and one more point. If you would like

to add in some of you who have done the clustering or the segmentation job

in your respective works the these numbers

we say three or four or five it’s not predefined. It is most of the time

given by the business that how many clusters or segments they

would be looking for though. There are statistical

ways of identifying that which is the best number

of cluster should be but basically most of the time

it would be coming from somebody in the business that okay. I see that lets say I

was working for one of the Indian telecom companies

here quite a time back and at that time

their subscriber base was around 300 million

customers and imagine that if you’re trying

to create segments for this bigger population, and if you create

three or four clusters, you can easily understand that it would be very difficult

for marketing team or any product team

to design products for such a big population so though statistically

it Look that okay, four or five unique

segments are there but you end up creating a lot

of small small segments and they may be a possibility that you will be creating

20 or 30 segments for such a big population. So my intent

of saying this number that we are trying

to identify three locations within the population has

to be decided either by business or people like you who have knowledge

about the data as well as that what kind

of business they are running and what is the final usage

of this segmentation exercise? So, let’s see one way of doing or selection

of these School locations is like we have already doing it. If you are in defy that somebody looked

at the homes and Casey the density is where the density is high and

selected the home automatically, but there are algorithms

also available to do the task and I can give the name here itself its key

K means algorithm. And so first we

would like to understand that how does an algorithm work if it needs to identify

which is the best location. So if you are looking

at my screen, let’s Our objective is that we need to identify

two locations first and we have some data

and scatter plot available. And we need to identify

where the school should be so that the distance

from home should be minimum if that’s our objective. So how we can do it that lets say we

randomly assign two points from the existing data set

and actually easiest ways that you randomly

pick two numbers from your data set itself. And then what you do is you

assign these two selected points as these are

your cluster centroid. So this is the center

of your selected population and in the Second Step. So once you have initialized

these two random points, then the next step is that you measure the distance

of all the homes from the initial selected point. So let’s say you do

the distance of this home from the selected point

and again from this that for each house

from these randomly initialized point we

Measure the distance from the selected point or the initialized point and any home location and see

which distance is minimum or which distance is less

in comparison to the other from the selected initial point

so we can easily see that this distance is smaller

than this distance and this point would be assigned

to this particular group. The first initialization step is initialized as

many number of centroid as many clusters you need and in the Second Step you

do the cluster assignment and in cluster assignment

how it’s been done is that you measure the distance from these initialized

points and see wherever the distance is minimum and then assign this home to that particular

segment or cluster. So this exercise

has been done for each home and just trying to show

for a couple of them and based on the distance. The assignment is complete. So this color also signifies what we have done is

after measuring the distance for each home from the initial points we

have assigned These points to this cluster and these Blue Points

to Second cluster. And then once this

assignment is complete, it moves the centroid. So what we’ll be doing is

it will be taking the center of all the selected points and then it will

move the centroid from the previous point to the next Point based

on the new assignment, which is already

been completed and then what’s been done is

the same exercise which was done earlier

in terms of cluster assignment that we measure the distance of each home from the centroid

so distance from this centroid and this centroid and wherever its minimum assign

it to that particular cluster and this process

has been repeated again for both the centroid and once the distance

is being measured on the improved

or changed centroid again, the assignment process

has been started. So once you have moved and then measure the distance

and then assignment also changes like it was done

in the previous step and we continue this process

till the time we have reached a The other point where this change in assigned

have stopped completely. So once we have reached

this kind of place or this kind of scenario, where as many time you measure

the distance from the centroid to the different points, your centroid does

not change this exercise or this point is called that your model has converged

and at that point you can say, okay these all group there

is one group of these points or these homes. So this is one cluster

and second one is this cluster? So this is how k-means work. It has wide

variety of applications. There is a function available

in scikit-learn library. You can try implement

it the intent of showcasing you this example

of unsupervised learning Wars that we will be having

two algorithms which come from unsupervised learning

section of machine learning and these would be your restricted boltzmann

machines and autoencoders which work on a similar

methodology of unsupervised. Learning so in a similar fashion like we started discussing

in the beginning that where should be

the location of these schools. We can use a key means

algorithm and initialize three points randomly and do this distance

measure to each home and assign the homes to a cluster wherever

the distance is minimum and we continue this process

of measuring the distance and assigning it to the cluster till the time these value

have been converged. The most important task

for any data scientist is not to remember

which library is required or what are the codes in my understanding

the most important thing which data scientist

should remember is that once you’ve been given

a business problem first, you should be able to understand that what kind

of problem it is, whether it’s a problem

of supervised learning or it’s an unsupervised learning given its a supervised

learning problem, whether it falls

into the regression type or a logistic regression type. If you can make

these decisions then for implementing the algorithm

you will find a lot of help in fact scikit-learn

would have initial codes for almost every algorithm. So you don’t have to remember

line of code and algorithms. All you should be able to do is once a problem

is been given to you. You should be able to identify

what kind of problem it is most of the time

in unsupervised learning and specifically in

k-means kind of models. We use this elbow method

as indicator or help. You understand that

what is probably a number we should start with for starting

the final implementation of your model, so Think give you an intuition. How does it work SSD stands

for sum of squared errors and what it means is actually if I go back a little that suppose you have identified

these two clusters. So sum of square error would be that you take the centroid and measure the distance

for the points which are associated

with this cluster. So you measure the distance for

each point in the orange group and some Square

all the distances and the same exercise being done

for the blue points and whatever the total number comes in

after doing this exercise. You will be getting what is the total number

of squared errors. And if you have two clusters, you would have some number

and just for intuition I’m saying this is total sum

of the square is coming as hundred and that’s only

for intuition and example. I’m taking this number

to help you. Let’s say there was

one more cluster somebody identified here and all

these three points do it. It’s blue in color, but I’m seeing all

these three point belong to this particular segment. And rest of these points

remain the same as it was previously and

as we saw with two clusters are some of the square

was coming as a hundred when we have three you can see that

these points are bit far off from this particular cluster. So if I will be doing it

with three clusters, this distance would be

a bit less given that now I have a point which is closer to these points

and whatever error or distance these three points

were adding it would be bit less given the cluster was

here and let’s say this distance goes down to 95. I’m just making up some numbers. So probably what

it is telling me that sum of squared is going down and probably I

am finding clusters which are closer or more closer

to the actual data points. And as you would know that if I will be increasing

the number of clusters in the population, this distance would be

going down hopefully and this distance can go up to 0 and at what point

this distance can go up 2-0 when every Point become

a cluster Self so if let’s say I have

20 data points there and I assign that every point is

a cluster in itself. Then just measuring

the distance from the point which would be 0

and overall SSD will become 0 so it may start

from a very high number but it will be reducing with

each cluster point or cluster. You will be adding to your data. So this line which is some time

being called the elbow method what it’s actually showing you that when you had one cluster

this was the distance. So if you had one cluster only

anywhere in the population and you do sum

of the squared distances, this was the distance

when you had two clusters, this was the distance when you hit three

these were the distance when you had for this was the distance, but when you had five the sum

of the squared error did not reduce much. So if you see it’s like

very less and after that even though you keep on

adding different clusters, the sum of squared errors

is not going down. So as I was saying that this process

or this method It is kind of indicative method

and it gives you an intuition that if I have done it my cluster analysis with

different number of clusters and I’m measuring the sum

of squared errors for given number

of clusters and I see that after for the the sum of squared error

is not going down. It gives me an indication that probably I

have found clusters which are more or less coherent and the population

is not very much away from the centroid

from the point. You can make it an assumption that probably four clusters

is a good idea for my given population. But as I also mentioned that it is just

indicative process, it’s a good starting point, but you need to see that how the distribution

of your cluster look like whether they solve

the business problem. You’re trying to solve or not. And if not that whether you need

to further divide the Clusters which your initial

model has identified. And here will be taking

very quick introduction to a third type of learning which is called reinforcement

learning what it actually is we have seen from the

to learning types. There is supervised learning

and unsupervised learning. The first one was that we are trying to predict

some dependent variable in the second one. We are trying to identify

some kind of structure in the data set or

if I put it into other words that we are trying to identify

some kind of coherent groups in the population third one

is reinforcement learning and it’s basically that any object or a system learns

from the environment and there is no right

or wrong answer given to the system explicitly

or in the beginning itself, like in the case

of supervised learning here. The object would be moving in the environment

and an example that self flying helicopters where they fly on its own

and they take the decision that what is the wind speed and

what is the pressure around it and they correct

their procedure accordingly and the objective

they need to You is that they need to fly

for a longer period of time. So here we are given an example. Let’s say we have a robotic dog and somebody needs to train

it to take correct decisions and correct things would be that it walking on the path

of where people needs to walk and it’s not going

down from the path and if some task is been given

it working correctly. So there are two components

of reinforcement learning which is called

reward and penalty if the object or the system does

the correct thing it receives some reward in terms

of mathematical things. Obviously, we will be providing

everything in terms of mathematical numbers. And if it does the wrong thing

to receive the penalty and basis this thing it

will keep on taking HD Seasons. So like a dog

if it’s working correctly. It receives points, like ball is being thrown

if the robotic dogs go and pick it up. It’s a reward point. If it doesn’t do

the correct thing, it receives a penalty. So most of like all these

reinforcements Agent working in a similar fashion. Some of you who are interested

in implementing it there is an algorithm called

Deep Q sin algorithm where you can design

your own system and you can assign what are the rewards

and penalties similar fashion reinforcement learning is also

interacting with the space as I mentioned. So self-driving car is

also one of the examples which would be receiving rewards

and penalties based on whether it’s running on the track taking

the right turns and moving at the correct speed maintaining

distance from other cars which are running. So reinforcement learning

has a huge implementation or requirement for

self-driving cars or some of the components of it. Not all some of the components also in self-driving car

are supervised learning for the point that car

needs to understand what the objects are in front of it and all other

objects identification. So what are the real limitations

of machine learning given that we already have all three type of algorithms

supervised or unsupervised and reinforcement

learning algorithm. D available then why

we want a new architecture or new type of algorithms for

artificial intelligence systems first and foremost

is the dimensions. And when I say Dimension, it’s like the type of data

we get from lot of sources. Let’s say we receive images

which is grid like image. So where the pixels and what is the strength

of pixels in the image natural language processing

so language data comes in a different length, and you know, the work is also

different in the sense that suppose we need to design

a machine learning algorithm which can do

language translation. And if you conceptualize this

idea of language translation from a machine learning

algorithm perspective your inputs become a sequence of words in your output

is also sequence of word and some of you who are working

in machine learning algorithm. Try thinking that whether we have any algorithm

currently available, like logistic regression

or decision tree, which can help me even fit the

Or fit the problem leave aside how good the accuracy would be

an all but these problems which come from a different type

of data source, and we trying to solve

a different kind of problem like language translation

or chatbot kind of problem where you give a sequence

and it returns your sequence. So these kind of Architecture is already not available

in machine learning. So there is one of the reasons that we need to

identify some algorithms which can deal with such

data sets like images and languages and second. It can also fit

different kind of models which are not only

for predicting or classifying but also give you some kind of values like sequence I take

an example of so there is one of the reasons first we

are looking for a different type of architecture for solving such

problems then second problem, which machine learning

algorithms are not very good in dealing with

the dimensionality. So we would have seen and with a size of let’s say

a thousand very It’s and that your 100,000 rows

and thousand columns. Probably you can still fit some of the machine learning

algorithms on top of it. But given the kind of problems. We are dealing with like

images every image. Let’s say it’s 200 by 200 means

200 pixels by 200 pixels and it’s a colored image. It means there are

three channels will be discussing in details, but basically all

I’m trying to say that is simple image of 200

by 200 pixel will be giving you if you do the maths

200 multiplied by 200 and image. So let’s say this is your image

and it’s 200 by 200 because every image is

kind of a matrix only and if it’s a colored image

actually colored image are being represented in system through three channels

red green and blue so there would be

three such grids but one top of the other so

number of pixels you need to have to represent your image in the system or in your

algorithm would be 200 by 200 by And then you calculate

how many features it would be if I found meth is correct. It should be

like 120,000 features. So even a simple image

of such small Dimensions you end up getting

120,000 features and plus if you are really working

on a complex problem solving in terms of let’s say an object

identification in the images. They may be five or six objects, which you need to identify

and you’re dealing with let’s say 100,000

images then your scale of data become so huge for any machine learning

algorithm to easily handle it and your machine learning

algorithms fail in terms of getting any interesting

results out of it. So coming to solution part

that we need an architecture which can not only read

such data in terms of images, but it is capable enough of dealing with such

huge dimensions of data. So this is the second benefit which comes from

the deep learning algorithms and we will be discussing

How do they And it’s such pie dimensional data when we go and talk

about different architectures and third and the

most important reason that we will be looking

for a different kind of model structure

or different kind of algorithm is for identifying the features. So let me spend a couple

of minutes on this idea. What do we mean

by identifying features? So in machine learning

algorithms vs data scientist spend a lot of time in kind of curating the important

features either first, you’ll be scaling the features and after scaling

you’ll be creating the interaction variables. Then you’ll be creating if the separator

is not very clear. Then you need to introduce

High dimensional data. Let’s say it’s your data point

and if you see that 9 you’re fitting is

not separating clearly then some of you would be trying

the higher order polynomials of your input features. So all such things which not only

difficult to you know, come up there is a lot

of trial and error that which kind

of transformation and which kind

of Rebel creation so what kind of variable will really work

for classification problem? That’s first thing and second is if you’re working

on higher order polynomials, what is the correct

order of polynomial? I need to create it in just

to give you the scale of it. Let’s say you are dealing

with only a hundred features and you need to create

second-order polynomial with interaction of

all these hundred features. Then you will end up getting

around 5,000 features from the second-order polynomial only if you want to get

third order polynomial like Cube variables or the interaction

of three variables together, then these hundred

variables will come around 170,000 features. So this creation of features is

very very difficult and given that in our image is

if you look at the image, which is in front of you if we go and start creating

these features on our own and our objective. Let’s say to identify a

television in the image suppose, which I’m highlighting here and we Some pictures where we

need to identify even though you have created

those features manually and some of you who are working in the field

of computer science for quite some time

would know that earlier. We used to use features

like sift Saft power features and hog features, but these are like kind

of static features for a given object

but we may argue that this television is there

in this picture here, but in other picture,

it can be somewhere else. So the feature which I’m identifying it has

to be special in difference that it can be

anywhere in the image and same goes for language that you’re dealing

with language data should be not only able to understand

the meaning of word or how does the word

fit into the sentence but should also

be able to understand that what is the context

of each word but these word embeddings neural network help

you understand it. What are the related word to a given word and from that

you make predictions, so these broad problems

of machine learning. Them’s one is they

are not being able to play with or deal with different type

of data like images and natural language second

is the dimensional problem if the dimensionality goes in

like 100,000 features and all and third is this feature

creation on its own. So these are the three basic

reasons that one of you are all of you would be interested in going to one of the deep learning

architecture for solving such problems and forth. If I may add it that all the Deep

learning architectures given that we are putting a lot

of computational powers in them. They end up giving you a better

accuracy both for classification and regression problems. So that that’s

the fourth benefit and how does it really

work their different stages in a deep learning

and why they are called Deep because it’s not just

input and output like we have seen in regression that you have a y NX some kind

of linear equation here. We have different intermediate

field like the Seeing in the screen like these are but there are a lot

of intermediary field for doing such complex calculations so that all these features

which I mentioned that suppose you need

to identify television all such features get calculated at different stages one

after the other and final stage. You have very very

refined features. Not only for image. We are taking

the image classification. But any problem we

are trying to solve through the multiple stages your model would be able to

learn these intelligent features which are really important

for your classification or regression or

any such problem, which we are trying to solve. So these were the few benefits

for deep learning and these are actually

the broad reasons that somebody would be interested in learning

the Deep learning algorithms and some of them we would be if you can see the screen now that the implementation

of deep learning has been into almost all the areas

either images in the language or even the structured data which we have Been working using

machine learning algorithms. There is a huge implementation

of neural networks. Now in dealing with structured data as well

for predicting like churn or who will be buying

the product or not. Even these kind of algorithms

are moving towards deep learning so though you would have seen

a lot of implementation from images or language only but the application of deep learning is now

happening in almost every field and the different architectures which are been implemented

for different such problems. So if you see some of the applications

are already here, like automatic machine

translation object classification in photographs, then there is a library

or there is an API being introduced by tensorflow

just a couple of months back which is called

object detection. So object detection API,

it’s very strong. API have used it already and it can help you classify

almost 90 objects in an image and it’s so powerful that it’s accuracy is almost

99 percent in some time. It’s Even the human

visual Powers we have handwriting generation. So there are new kind

of architectures, which is not in the scope, but there is a new kind

of algorithms called generative adversarial models or networks, which is called gain Gan and these networks

are really powerful in generating new data set

and what I mean is that you give some images let’s say you have

some thousand images and you one generate fake images

or images from this data, which looks similar

to the images. You are ready have it means

you can generate your own data from the existing data set. So we have a separate algorithm

and image captioning is there that you can embed

different models together for generating text

from the image that what’s happening in the image. You could have seen

the image captioning and game playing we have already

seen automatic car drive. And also there’s

a lot of applications. This is just a small

list of applications one small example is Google Lens

you can try installing and see how does it work. It can read a text

from the image itself. It can identify

different objects. So basically they’re

huge implementation of deep learning modules and algorithms in different

spheres of business problems and different kind of data and different kind

of business problems. Some of them are here

what these tensors are so these tensors are nothing

but area of numbers. So these are just

multi-dimensional arrays where the numbers

are being represented in terms of Matrix as we call that this is a matrix

3 and 3 6 and 4 inch by six by four Matrix we

can see this is just a vector and they’re ranked represent

that how many dimensions are so if you look at the vector

is just one-dimensional tensor. There are two Dimensions

one this second is this so it’s second dimensional and

it’s a three dimensional array. So it’s like three dimensions

here and people who are familiar

with multi-dimensional. Is it should be fine? And if not that probably this

example can help you visualize that what we are talking about. So number seven, which is just a scalar in Matrix introductional Matrix

computations would have heard that single number

is called scalar and here it’s rank is 0

they just a number any Vector of any length would

be called of rank 1 this is 1 dimensional array. This is a two dimensional array and most of the data

would be coming in 2 dimensional array shape like just to give

you an intuition that you can think of these as different

columns age gender income City and all those different numbers so you can think

of these columns and these are rose. So whatever the data

we have been dealing with structured data form, you can think of for

the structured data. So this is

a two dimensional data, or we can call it also

or data of Rank 2. Also the image data which comes in this kind

of format so image is also been either in two

dimensional three dimensional. But basically if you

look at the image, there are some pixel values on the image and based

on those pixel values. The image comes up on the screen so image is also

a two dimensional array if it’s a black and white image. However, if it’s a colored image and I also mentioned I

think just some time back that there are three channels or three colors which make

different kind of colors based on the combination

of three channels. So basically a color

image look like this that they would be

three arrays of numbers for three different colors. So RGB red green and blue and they would be

pixel intensity on each one of those something like which

you are presenting here and based on the pixel intensity

of all these three on top of it different colors

would be coming. So in a combination

of these three pixels for three channels would be

generating different kind of colors on the screen. If you have this kind of data, it’s called rank 3 because there are three

dimensions your data can be of more than three dimensions. You would know that it will be

very difficult to visualize. So this is a basic understanding or introduction to

what these tensors are and how do we mean

by different Rank and dimensions of tensor? So let’s get started. So one thing is the takeaway

from the point. We have discussed

in couple of Sliders that data has to be converted or data has to be represented

in terms of tensors before we take

any kind of calculations or mathematical computations

intensive look shape again, we have seen that if it’s a number only

its Dimension would be 0 but the shape of a vector if they would be

different functions and we have already seen

in the previous example, if you recall when we

were running the regression and logistic regression, I ran this shape function so x

dot shape and Y dot shape when I was running it was giving

me a number of rows and columns when I ran it for Y where it will be just gave

me the number of rows. So it’s in the similar fashion

is just a one-dimensional array and there are

five objects in it. So the If is

54 a two-dimensional array that Matrix have a shape

of number of rows and columns. So this is a shape of this in this and for

a three-dimensional little be like rows columns and the depth

of the multi-dimensional array so there will be

five four and three. So that’s how you

need to read it. So five in this direction

five in this direction for in this direction

and three in the depth. So every data point

and why it is really important and why I’m spending

a good amount of time on this that even today most of the time we end up making

a lot of mistakes in terms of defining the dimensions

of the data and you will see that when you really

design and architecture for deep learning you will

be mentioning lot of points. Like what is the dimension

of your inputs? What is the dimension of output? What is the dimension

of different weights and they would be

huge amount of Weights. Actually, this is the only

big problem in tensorflow or any other deep learning is that you still need to define

the weight dimensions on your own and most of the time problem

comes from This section itself when you go and design your own

deep learning architecture, you need to be really cautious

about the dimensions. So there are two sections

intensive flow Library, which is nothing but a tensor flow is

an open source library, at least for now

from Google It Started from the Google brain project and now it’s available

to all of us free of cost. There are two sections

of tensorflow one is the tensor which we have already discussed that we need to convert

our data in terms of numerical representation of every data and then

second is flow and people who come from the programming

background would already know that what is the lazy evaluation

but for those who are new so what do we mean on intuitive level that

we give our inputs? So I’ll give you

some names here. So X is our input

features WR weights associated with it met

melisma tricks multiplication. So you take X and W do the

matrix multiplication ad Buy? S term so we are taking

the bias and adding it and then we are planning one of the activation functions

whatever the output comes from this you apply the activation

function called radio. So intense a flow

how it would be running that you will not be getting

output for each of these inputs till the time. We don’t explicitly call it. So how it works in tensorflow is that you design the entire flow

of your algorithm or your program and only

you run this last component, which is the real ooh, and what will happen is that when you run this command or just really operation

it will go back and automatically calculate all

of them in the back end. You can see the output

whenever you want, but that’s how it works that you don’t have

to run it individually. You just run the last part and it goes back and run

the intermediate or E sections of your functions. As I said, there are two sections

within the tensor flow which we already seen

the tensor and flow and within flow it would be done in two sections one you

To define the graph so you will be defining that what old functions and water calculations

you require in the program and then you have

to explicitly run it once you have installed

tensorflow in your system calling the tensorflow

library is as simple or as we have done

for the libraries like pandas or numpy it remain absolutely

the same way that you import tensorflow SDF and it just the name you

can give any name instead of d f and I’m running this code on 2.7 if somebody is still

working on 2.7. It should be fine. There are three type of data objects are

data types in tensorflow and all the programs are

everything you will be written in tensorflow from now on

would be one of its type data. So you will be explicitly

telling the program that I’m writing this

whatever line of code and whatever assignment

function you will be using a for one type of these data. There are three basic types, and these would

be used extensively across your programs. So first one is a constant

and constant intense flow is absolutely the same whatever our understanding

has been of a constant that constant is a value which doesn’t change

whatever value be assigned to a constant. It remains same for example, if I say equal to 5

and if I think that’s a constant then the value of a will always remain five

it doesn’t change at all. So the same convention remain

in tensorflow that if I have specified

a data object as constant its value

would be same across the program and will not be changing and I can assign constant

as a string or a number or an integer. It can be any of the times. So let’s see a hello world

program in tensorflow that how you can run

it that hello world. I’m assigning hello

SDF not constant. This is vapor because I have installed or I

have imported tensorflow stf. Now once whatever name you assign it as then

you would be calling all the functions with

the same abbreviation itself. So here I am saying that TF naught constant and I’m giving the word

as hello world and though it has been in all

our programs to now in Python that once I have assigned

a value of some object if I want to run it, it should give me the value but here you see it’s

not giving you any output. My expectation was it

should be giving me hello world, but it’s giving me TF not tensor

constant shape D type is string but not giving the actual value which I wanted

and why it is not giving because that there

are two components of a tensor flow program one was that you do design

or set up the graph and second is running it. So intensive know everything has

to be run within a session. So this session is

the Running part. So we have to explicitly

tell the program that run this command or run this object

within tensorflow session. And this is one of the ways of writing

a tensorflow session command. Did he have not session

an S has to be a bit later as says and then you

have to use actually this command says don’t run

and then you say what you want to run it for. So if you see

I’m calling an output by running says don’t run and then I’m calling

this hello constant and then I’m asking

that print the output and if you do it now, it would be giving me the output

which I expected earlier. So this convention

would remain same across all the programs from now on for tens of flow that

whenever you have done the assignment part of it. You have to run this function or any line of code

within a tensorflow session only and one of the ways

of doing is this so they’re different type

of data like float32 is residual value. They would be in T and again

into and these 32 64 is like the bit shape of your data

being designed and people who come from

programming would know that how many bits

being taken to Define your number in the back end

from that perspective. There are two or three types, which will be working with

extensively one is the float and most of the time we

will be using float32. And other one is in which we will be using

for specifying integers. But for Constance the float

32 is a default type of data till the time we go ahead

and explicitly mention that we want to specify constant as nth then you have to tell

that the constant which I am specifying. Let’s say Node 1 is

d f dot constant and I’m specifying what

this value is and I’m specifying whether it’s float type. And if I have just given

the decimal here it is easy to understand for the algorithm that it’s a float

32 kind of data and as in the previous command, we saw that when we

run this code, we don’t get

the output of Node 1, which is we were expecting

three and four node to we were expecting for but we get this information that this is just a tensor and what are the type

of data each constant has and like we run the session

in the previous. Example I was saying there was

one of the ways where we said that with tensorflow

session as some name and then run it we

can Define it in another way. Like we can say sesh

equal to TF dot session and then I can run

whatever command I want to do. So here I can do it

that print says don’t run Node 1 and node 2 and says closed

this one is very important that if we are

specifying our session in this particular fashion where we are saying says equal

to says not session then we have to explicitly say that now close the session. If you don’t close the session

it would be running and they may be some issues

from the point that suppose you have one Node

1 multiple times in your code. So they may be a possibility that it may pick

up the wrong value. So if you are

specifying your session in this particular way you

need to say says Dot close and if you running

from the previous way like we did here

with tensorflow session. So in this particular fashion, if you run the command

with tensorflow session whenever the Is done

it closes automatically so you don’t have

to worry about that. You have to run this command

of says dot Crews. So there are

two ways you can use tensorflow dot sessions here. I wanted to show you that there are very

simple calculations which you can do. So here I’m showcasing

this flow understanding when I presented

the first slide on tensorflow, I was saying there are

two components tensor and flow so tensors you can Define

like this as a constant. So this is one of the ways

of defining those tensors and then I am doing some kind

of mathematical computation. So let’s say I have said that I want to assign see

as a multiplication of B. So this is

my formula or function, which I need to achieve as all of you know

that we will be doing that a lot of computations

in our deep learning program. But here I am showcasing

very simple one that if I need to multiply a

with b 2 it remains same as it would have been earlier that it doesn’t give

me any output. So if I run see I don’t have to recall

like in the previous function if I just go back A bit here. If you see I’m running my Node

1 and node 2 which is nothing but the assignment of constant but what I’m trying to show here

from the flow perspective, which I mentioned

in the slide that when you have designed

a program something like this where you specify A and B

and then you also specify what is the mathematical

relationship between a and b and what is the resultant value? All you do is you run

the final value or final step in your overall program

and here C is the final step. So I only run see and

what it does in the back end is that will be running CNC OKC

is the output of A and B, and then we’ll go back

one way further and say, okay what A and B and do running

the program according. So here if I call only C and iron see it will be calling

A and B accordingly and give me the final output which is 4 x 6 which were

the constant values of A and B. So this is a very small and toy

example of how the flow works and had there been

multiple functions. After see you would be doing like d is multiplication of C

multiplied by something number. So you just need to run D and all the other functions

will be running automatically. So the second type

is called a placeholder and we will be using extensively

this kind of data type and I really like

this one line is available. I have copy pasted

from tensorflow website itself. And it says the tens

the placeholder is a promise to provide value later and the most simple

and easier example is that most of your features

and label values like your X and y’s would be initiated in your tensorflow program

as placeholder values and what it does is it that you can assign any value. Let’s say you have assigned a is

equal to DF dot placeholder. So we are saying it’s

a placeholder type of data and the type of data is float32. So whatever value I

will be providing in future for a these would be float. B is I’m mentioning that will be flow type

32 and then I can assign a operational

mathematical operation. One more thing. I would like to add

that placeholder objects will always be coming with another thing called

feed dict or feed dictionary. So whenever we have

a placeholder object, we will be giving all the final values

with the help of a dictionary. So people who know what a dictionary type is

in Python would know that we can assign key and value

in this particular format. So here it’s been represented

in Python with curly brackets. And what we are

actually saying is that there is a dictionary

where a is equal to 1 7 and 6 and if you recall this kind

of representation is a list so a is a list of 17 +

6 + B is a list of 320 and to you can think of this that you are assigning

your ex is equal to 1 7 and 6 and Y is equal to 3. 20 and 2 and this is the same

way you will be designing your input as well as

output features of your model. And when you run it, obviously you get this output

1 plus 3 equal to 4 7 plus 20 is equal to 27 6 plus 2

is equal to 8 the benefit of this kind of implementation. Is that your data on which you are developing

your model its shape and size may change anytime. So for example, if I take a very simplistic

approach that initially when you run the model,

you only had three data points, but tomorrow you have got

two more data points for your information. So let’s say you have added

two more information for your data and this

is the new data set in when you run it. So basically you don’t have

to change your code on top. You just you can mention and change the assignment

in your feed dictionary and your operation

can be changed accordingly and that would be

very very helpful when you’d be

running bigger programs. Where you X values and Y values you may be

getting your changing. The number of columns number of

rows in your data set and trying if there is some kind

of problem with your data which you need to change

every time another thing. I wanted to show you that let’s say like

in the previous example that apart from

this loader at a plus b if I assign

any other calculation like I’m saying x 5

and this is the output coming from this previous operation

to multiply it by 5. So I don’t have to run

the intermediated functions. All I can do is I

can run the last function and here I’m kind

of giving the inputs of two dimensional data. So if you see it’s

like three rows and three columns kind of data it will still be doing

its job accordingly third and another most

important variable comes from the prince of flow

called tensorflow variable. And if you see that this is a capital V and

that sometime you may make mistake in writing it

as small we’ve given that On stand and placeholder

start with the lowercase words or letters now first

discuss and understand what a variable May and this definition again

one-liner for variable from the website itself is

that variable allows us to add trainable parameters to a graph what basically it’s

trying to say is that d f dot variable type

of data is the kind of values which you can initially

assign anything but during the program you

can change the values for your purpose and I will be telling

you why we need it and that’s the important stuff

for your learning process. So from the syntactical

point of view, you can assign just like we did for a placeholder

or constant in intensive flow. We can assign a value. Let’s say we assigning it w 1 and the W1 is you

can think wait one. We are assigning TF dot variable and we are saying they

Niche the value is 0.5 and it’s DF not float type. This is the way of assignment but there is one more

syntactical line with this is required to run your code. Then every time you are assigning a variable

type data in your code. You have to mention this line, which is TF dot Global

variable initializer. If you don’t mention this line, your program will

not be assigning value, which is point

5 2 W 1 and in fact, if you run it, let me comment it and

run it without it. He’ll be giving you

a huge error in saying in it is not defined and even though you change it something else it will be giving

you a huge error because for assigning the values you need to run

this particular line of code when you run this what it does it initialize

this TF variable type of data and do the assignment

of .5 2 W 1. So this is only

from the syntactical way that whatever number of d f dot variable type

of assignment you have done in your program end

of the program. You just run this line so that the Mint is complete

for all the variable types and then you can use it the way

you wanted it earlier. So this is a very

simple intuition. So let’s take

an example from regression. My intent here is that I need to explain it

to you the concepts behind the cost function and the optimization but

let’s say this is our X. This is y a very

simplistic approach and this is our actual data. Hopefully you can see my screen. So let me change the color. So this is the data we have and

what is the best regression line which goes through the data set

or does the good prediction because you know that they can be

n number of line which can pass through

I can use this line or they can be this line or they

can be infinite number of lines which can go through. So, how does your algorithm

decides in say that? What is the best line

for your equation and whatever output which came yesterday

after the psychic land. Let’s say it was this. I’m just Up some numbers that intercept value is

point 2 and the coefficient of x is point 3 and this

is the equation of line and probably let me highlight

this another time. Let’s say this was

the green one is the line which your model has selected

but you would agree with me that your model has

tried different lines and after that it has selected. So how does

this process happens? And actually this is core to all our algorithms either

machine learning deep learning or whatever but this process

of one thing called cost and second thing

called optimization these two things are common

across all your algorithms. And if you understand this part

then all you need to do is that you just need to see

what are the variation between these two things

for different algorithms. This thing does not change

for anyone of the algorithms. So, how does it really work? So let me first talk

about only cost. So what it does, let’s say this is the green one

on hovering over. This is the line

and I need to understand what is the cost for it? So what it does it it will say. Okay. This is the predicted value for this and this

is the actual value. The total cost for this particular line Square

sum of all these distances. So whatever the distance was here here all

these distances I take and I take Sigma of all

the distances and distances are nothing why actual – let me say it I and why

I had for a given line so we have understood how do we calculate the cost

and our objective is that whichever lines we are

fitting in for all the lines. All we have to do

is you compute the cost and whichever line is giving me

the least value for this cost. I say this is the final line

and you give the slope and intercept for

that particular line. So now we have understood that how the cost

is being calculated but there is one thing

still being not clear that how does

a system identifies that which is the line and how does it usually

start that line? So as I was saying initially, this is the data all we know is that this is our It is

and we need to find a regression line for it. And just to mention that this process

remains same for both classification regression where the multiple

class single class and the baby will be calculating

cost and optimization. That would be different. But the process remain

exact same let me write what we were looking for cost

and optimization. So cause to be understood

from this toy example that how do we calculate cost but we were at a discussion

on d f dot variable and what this T

of that variable is and why it is important. So let me come back to it that YT F dot variable

is important so we know that we have just one variable

and our equation is going to be Y is equal to intercept

some coefficient plus the slope multiplied by X. And as of now, we don’t know what is the value of intercept

or slope would be so how this process has

been designed for any algorithm. I’m mentioning it. I’m giving emphasis

that any algorithm. Is it will be same

across any one of the algorithm. So how much been done

initially is let’s say I’m representing a for brevity

and for be the slope. So how we initially do it that we assign any value

a as TF dot variable because I know

this its value can change the F dot variable and let’s say

initially I assign it to zero they would be different

intelligent ways of Designing. How do we put the initial value? But let’s say I

have initially said that both the values

of the F dot variable is again 0 so you will agree with me if I have a equal to 0

and b equal to 0 then the predicted line from this equation

would be what my x-axis because both the values are 0 so Y is always 0 only

on this particular line. So what L do that

once I have initialized these values I will calculate

the distance for this line. So basically the prediction

for my model is the bottom line This x axis. This is the prediction coming

from a is equal to Zero and b equal to 0 and then I calculate

the distance from my line and I take distance for all

of them for all the points. I’ll take the distances

and I do calculation for my cost then actual values

of these points and predicted values are

all these 0 values and I take the distance

for each point and sum of squared errors. I do and whatever the cost I

have got this is been fixed now in the next either ation

because there will be I trait if process that I’ll be changing

my values of A and B and I’ll keep on doing

till the time I reach to a line where the distance

between the predicted values and actual values are limited. I’ll be talking about

how these values are changed. So I have done

the cost and let’s say that this particular instance

the sum of distance errors or square error is

hundred so distance from each one of the point

from the line in square and Sons coming hundred. So next step. What I do is I change the value of a and b because of

these are variable types. I can change the value. So what I say

in the next iteration instead of having the value

of a is equal to 0 and b equal to 0 assign them

a is equal to previous Value Plus 1 and B is equal to again

updated by positive one. Now, the a is value is 1 and B is value is 1

then probably this line would be something

like change the values of my A and B and now it’s

bit of positive then probably it would look

something like this because now a let’s say this is

the distance of 1 and again x value is changing by one

and you are adding this up. So now instead of having

this line I end up getting a new line this process

of changing my weight in a direction where the cost is reducing. So let’s say for

this particular line where a is equal to 0

and b equal to 0 cos 100, but when I updated values of a 2-1 B21 cost

has reduced to let’s say 90. So what do I mean is that once we have identified

that we have updated the value of A and B from 0 to 1

the cost has gone down. This process is

called optimization. So these two terms caused

an optimization they work hand-in-hand and the process is that you take any initial

values of A and B. I am initializing with zeros and then you compute the cost

between the difference and then you update

it in the direction. So that cost can be reducing

I’ll be going in details that how does model understand that which direction the way

it needs to be changed whether it needs to be add

1 or subtract one or add to or add 5

that will be going in details, but from the intuition

perspective this process of updating your weight

in the direction so that your cost is minimized in each direction

is called optimization and we’ll be going in different. directions to understand

how does model understand once we have understood

the concept of loss and it’s one of the most important concept

I am saying it again so let’s have a look

we are first showcasing that how you can set up

a linear equation in the model so what we are doing

is suppose this is the equation y equal to B plus WX we need

to design in a model so we are initializing W

and B is the variable type and the values are .34 wn- .94 be and then X

is a placeholder so like we said we will be

providing the value of RX later on we have decided

the linear equation is W X plus b this is the line of code Global variable

initializer to initialize the variable type objects into the model I initialize

my session as DF dot session and then I run in it so that the values of variable

types can be initialized and then I run the linear model

as Says don’t run. I’m running the linear model

with a feed dict. So I’m not mentioning feed dict. If I go back I mentioned that your code

automatically understand that all the values between this curly bracket

our dictionary type. And now what your model

is doing is it’s taking one multiplying it by W which is point 3

plus minus point nine, which is the B value and it’s been done

for all the values of X. So this is a very crude way

of assigning the model but here if you see we are not doing

any weight optimization or we are doing whatever

the value were there. We have initialized. They remain same and whatever the output

came we have taken as it is for addition. We can use t f dot add for subtraction are

TF dot subtract and for multiplication multiply

is the about multiply and one more important thing is for matrix multiplication will

be calling a function call Met. Mul. TF dot M80 mu l so that stand for

matrix multiplication though. It’s not Important but I’ve just put in

there to explain it to you that sometimes your data

is not been defined in the way you want it so TF dot cost is a function in which you can use

to convert your data. So if you see that, this is a float type because I’m taking 2.0 and if I need to convert it

to the integer type I can use this cost function I think

cast is also available in SQL as well. If so DF dot cost and it can convert it

to the integer type. This one is for divided. So we understood some

of the basic building blocks of ends of flow and these

were constant placeholder and variable these activation

functions are kind of mapping functions between input and output and here the input and output are

not those input features and output levels. I mean, but all this input

and output is that if you provide a number

to a function, it kind of does

some mathematical computation on top of it and return some value it is been used in numerous

ways numerous places. And some of these specially the sigmoid 10h real

one softmax the bottom for you will

be using extensively throughout this course. However, these activation

functions are nothing to be scared of these are

just mathematical functions which transform a mathematical

value in some other value. That’s why they

are called transformation functions as well. So first one is

called linear identity. And basically it’s nothing

you can think of it as a no function

been applied though. You will be hearing that linear activation function

is used and basically what does it do is that whatever value you

give to the function if you give functional value

of 0 it will return 0, if you give to it Britain’s to if you give n it returns and so this kind

of line you can think of the whatever value you give on EX the same value

of y is being returned. So it’s kind of identity

function also being called. This has been used

in the output layer of a neural network where we need to predict some kind of numerical

value letter regression kind of problem. We are going to solve and we don’t have

to give the prediction in terms of probabilities. We just need to predict

some real numbers. Then we can use in the output layer this kind

of activation function or you can also think

we are not applying any activation function

as simple as it is, but in some text

it’s been called a linear activation function. So that’s why it’s just good

to know second type is called a threshold function

of threshold unit function or a unit step. What does it really do is it’s

a kind of threshold function and what it does it that if the value of x is

beyond some value here. We are given example of

if 0 is greater than x then 0 1 if x is greater than 0 so that’s like some threshold

we have put in but if you look

at this picture here what we are saying is

and what we are depicting that if the value of x or your input The number

you are inputting into the function is less than equal to 5 then

it returns you see row and if the value is greater

than 5 a Returns the value 1 so it’s kind of unit functions and it’s been used

at some places where you need to set up

some kind of threshold that if the value is greater

than this return me this value if the value is less

than that return me that particular value but it

is not been used extensively throughout the program. Now comes the hero

of activation functions and that’s called sigmoid

or logistic activation function what it does it when you give some number

to this function, it Maps the output value

between 0 and 1. So if you give any value

between minus infinity to plus infinity, it can convert it

between 0 and 1 that how it is helpful in terms of predicting

the probability for an event. So if the value we

are inputting is positive and bigger the output from this function

would be closer to 1 if we have in putting a negative value the output

would be lesser value. For example, X is equal

to 1 then output would be 7.73. If it’s 10, it’s very close to 1

you can also think of just putting some value

into the function and it Maps it

values between zero and one fourth function is

all 10 hyperbolic or 10 H. And what it does is that for a given input value

it transform it into in between Minus one and plus one to something

like logistic function converts the values between 0 and 1. It’s very similar in nature like

sigmoid activation function. The only difference is instead

of converting the values between 0 & 1 it converts

the value between minus 1 and plus 1 and for the value

of 0 the output it zero, however for a value of input

0 the sigmoid returns .5 and you can

easily understand why if we input 0 here then e

to the power 0 would be 1 + 1 by 2 becomes 0.5. So this is the point

when input is 0 as far as implementation is

concerned sigmoid is being used in the output layer so we can predict whether the probability

of event is 10 H is been used do we

haven’t discussed architecture, but it is been used in lot

of mathematical computations in your neural network to derive

the best values of weight. Another activation function

is called re Lou and this has actually transformed

the entire learning process of neural network that’s called rectified

linear units and though it looks very simple that all it’s doing is for all

the values of input or less than equal to 0 it returns zero, but for all the values positive,

it Returns the same value so it become linear for all

the positive values though. It looks very simple. But let me tell you it

has changed the entire Paradigm of neural networks. This is the algorithm with this

is the activation function been used for deriving

the most intelligent weights in neural network softmax

activation function is very similar to sigmoid

from the point that it predicts probability, but it works very well when you have number

of classes more than two. So if you’re working

with more than two classes, it converts them the probability term and I

can see there is a major typo. It should be point zero

1.0 1.0 one because it should be summing up

to Total value of 1 so what it does it that it gives the probability

for each class of an event and then it should

be summing up to one. So let’s say therefore

categories category ABCD like here so we’ll be giving what is the probability

of an event happening in class A let slip .5 here. It’s point three point one and point one and it

would be summing up to 1 So Soft Max is also

been used in binary cases, but it’s major implementation is

when you have classes more than 2 the output

is very similar like sigmoid does but it’s

for more than two classes. So this is a brief introduction

on activation functions. So we are importing

numpy Library as NP you can Define your own functions

this particular way. You can Define

your own custom functions. If you need to do your own

calculations a d e f death or Define is the word

for telling the python that you are writing

your own function. And if I just talked

about the syntax, what we are doing is we

are giving the name of our function

then we are saying that we need to provide an input

and once we do the input what it should return

so return is like what the value it

will be giving back when you have run. It should be giving 1/1 +

n p dot exponential. It means we are calling the exponential

from numpy library and then minus a so whatever value

you give here you put into this particular formula

return the value. I’m doing it for

these many values of a and I’m running

for most of the values. Let me just do it

what it’s doing at first. It’s running the Activation

function for one. So if the value of a is 1 this is the output your sigmoid

activation function there if the value of a is 2

it becomes even bigger. So as I was saying that value when you are in put the number you are giving to

a sigmoid activation function is positive and bigger

it will be closer to 1. So if you see I

have given value of 4, it’s very close to 1.98. When you give five it’s

in fact even further close. However, if you give a negative

input to this function, the output is very close to 0 and this particular way

your model is being able to do predictions between 0 and 1 and if I give

the value of zero, it Returns the value of 0.5. However, these activation

functions are also available in tensorflow where you can just call them rather then you

define this function. So here I have written

10h activation function here real oo. This really is very simple. So you only have to do is you have to return

the maximum between 0 and the number so

if this a is greater than 0 it will return a and

if a is less than zero to ten zero so you see how easy

Ray Lewis implementation is that once you run it? All I’m doing is I

have written the relu function and all I’m doing is

return me the value if I input a is equal to -4, what would be

the rail you of a so if I do it for – for it returns zero for – 288904 positive returns

positive 24 positive for it returns positive 4. So in this particular fashion, you can write all your functions

in same I have done for softmax and sigmoid. So rather than

defining these functions, you can call tensorflow sigmoid

tensorflow 10 Edge tensorflow and indoor tree loose and instant for neural

network intensive look, but for real, you need to call it this way

and It will give the output as expected from the above

formula intensive flow or in this deep learning architecture. It is always a good idea

to initialize your weights with some random numbers and there is a function in tensorflow called

truncated normal. The truncated normal is

that it will be generating value normally distributed

numbers with mean 0 and 1 standard deviation. However, this truncated word

come from that all the values which will be generated

from this normal distribution would be only from plus

minus two standard deviation and not Beyond it truncated

normal is being used so that you don’t have extreme

values from the distribution and all the values you are generating from

this particular distribution are between plus minus two standard

deviation with 0 mean how does this architecture

or mathematical architecture got the name of neural network

in first place. So the intuition for this kind

of architecture of a learning Go to them comes from

biological neurons and people who come from biology background

would already be familiar with this kind of picture, but for others who are not very much familiar

either with deep learning or with biological neurons. That’s how it look like and there are

three major components which have been used in building

the Deep learning architectures. And these three architectures

are these three things are first one is called dendrites and then rides are nothing

but these are signal receptors. So for an example, there are billions

of biological neurons in the human brain and lot of these biological neurons are

attached to our sensory systems. For example, there

would be lot of neurons that ditched to our eyes and when we see something

we receive signals in terms of light or something

and these dendrites what they do is they

receive this signal which is coming from the eye because they’re connected

to the retina or whichever part of the eye which is processing

the information and once these dendrites have

The information this Summit up in the nucleus of this neuron

and that’s called cell body. So if you take the intuition what’s happening is

from these different dendrites information signal

is being received. And then this neuron

is summing up. What is the total information

coming and then the final point or the third point from a biological neuron

is is called an axon and what it does it. It passes the total information which has been received

through this X on to a next neuron, which is connected to it because one neuron

would be connected to a multiple neurons

and they would be lot of neurons when put together and when the information received

from all these signals, which is coming from

their respective cell bodies reaches a particular threshold

your brain take some decisions and this information

is being processed in the form of electric signals

or electric current. This is the same architecture

which in 1950s McCallum and pits came up with first

architecture called perceptron. Which kind of mimics this architecture of receiving

signals summing it up and passing it

to the next level. And from there only we built

the complex architectures in deep learning. And if you look

at this architecture, this is very similar

to a biological neuron from the perspective that we have some inputs x

1 x 2 x 3 and x n w 1 W 2 W 3 and w n these

are the respective weights and what’s really happening is

from these arrows. We are trying to signal that we multiply

respective input. So x 1 x w 1 x 2 x W2 x 3 W 3y and here

in the cell body. We are summing

up the information. So this is like in

the previous picture. We are calling

the cell body here. We are calculating

the total information coming from our inputs

with the respective weights in activation functions nothing

but mapping functions and what these are doing here that if this information which has been coming

from the inputs with the respective age if it is beyond

a Equal to threshold. Let’s say you are applying

a threshold activation function. Then your model would

be outputting 1 or 0. So for example, let’s say if the

output is positive and it’s 10 coming

from these information and you have put in

a threshold activation function here something like this. If this output is greater than equal to 10

then pass one else 0 let’s say this is

the function you have applied and given this information. All your activation function would be doing is it

would be looking at this output how much this outputted and accordingly it

will be passing the value given this kind

of a threshold function which I have written here. It is greater than equal to 10. Yes. So the output would be

1 this is the process how I perceptron

produces an output. So here we are just showcasing that this summation is

nothing but the input with the respective weights, so here it’s showing

wi and multiplication of X High summation

of this and for bias term, how is Being done is before I go further

that for bias term. What we do is we

introduce a column or a feature in our model. Let’s say x is 0 and all the values

and x0 you say that there are all ones and then we also introduced if wait called W

0 which is nothing but what used to be

a bias term and in fact, how do we do actual

implementation here? It’s showing wixi

I starting from one. So if you just change this I equal to 0 then you

can easily do multiplication because I X 0 is already one and you’re just multiplying the bias term

in the actual implementation. You will see that we will be doing

from I equal to 0 so we are adding a bias term and for

all the inputs in the model, we are multiplying

with the respective weights. So that’s how we can make it simpler from the

implementation point of view. So we will be calling

W 0 as a bias term. So here that we have just introduced

another column for x0, which have all the values

who are fun. If I have a column x

0 or feature which have all the values at every cell for all

the training examples has one and then if I am x w 0 which is a bias term,

you can easily understand. I’m just getting

the bias term for it. So given this understanding that how does a perceptron

learn so now let’s come to an example of a perceptron. So for an example, if you look at this class

and if you want to discuss this intuitively, what we do is that this is the actual data

where we have three pictures of dogs and we have

three pictures of horses. And what we want is we want

to create a linear classifier which can separate

these two classes. So how would your model

would be doing is as I was saying that initially these values of w 02 WN would be

Initialize randomly and this line would be created

based on these random and then you see what is the prediction and how many images

have been predicted accurately and how many have been

predicted in accurately and that’s in accuracy

is called error term. Let’s say you have fit

in this first line from the random weights and you see this horse

has been protected as a dog and this dog has been

predicted as a horse. It means there are

two misclassifications or number of errors are to then in the next I tration

your model would be learning that in which direction

your coefficient or weight should be changed so that this error goes down

and next line is Being Fit is this so you have changed

the coefficient values and this line get transformed

into this line. And then you see again that

how many errors are happening and this time only one error because this dog has been

classified as a horse and number of Errors is one and then the same I’d rate

of the next Hydration, you improve the weights

so that there goes down and then your error

has been reduced to zero and this is the line which your model

should be fitting and we have seen that

through the logistic regression. We also fit a line in the first module

in the learning process. We didn’t go through half as the logistic regression

learns or line. We just called a function

from scikit-learn and this function

fit this line for us. But here we will be learning that how does

your algorithm learn that this is an optimal weight

and we will be going through the different

building blocks of codes which help your model understand that this is the best

line for the model which is giving me

the least number of Errors. Let’s take a

very simple example, all of us should be familiar with and and or Gates and these

are you can think of it. It’s a kind of classification

problem these two gates. So if I look at the

or gate you will see that if any one of the gates

are open the output is 1 and if both are closed And output is 0 and you

can also read it this with it if x 1 and x 2 both are equal

to 0 the output is 0 and if anyone of either X1 or X2

is equal to 1 the output is 1 so you can think of its kind

of a classification problem and same as end gate

where the output would be 1 only if x 1 equal to 1

and x 2 equal to 1 and then rest three scenarios

the output is 0 and let’s see how we can use a simple perceptron

to solve this problem. So here if you see we have input and output and all

we need to do is that how does your perceptron

would be helping but if you want to see the process of

how does an algorithm understand what should be the line of

separation between two classes so we can think that there are two classes

one is y equal to 0 and y equal to 1 and we

can present it this way. So this particular point if you see here X1 equal to 0

and x 2 equal to 0 and output is 0 we Showcase it to like known filled Circle

and for all other three either X1 is 0 or X to 0 in these two points

and this particular point, which I am hovering

over now here X1 equal to 1 and x 2 is also equal to 1

and now my objective is that I should be able

to find the line which can separate

these two classes. And let’s see. How does your model

can fit this line for the model so we know

what the inputs are we know x 1 and x 2 values and let’s say

a model already have learnt that weight associated with X1

is W 1 which is equal to 1 and W 2 which is weight

associated with x 2 is also equal to 1

and here we are considering that let’s say your bias term which was B is equal to 0 just

for example purposes. So if you look

at the first example if I want to do output

for this section, how do we get the output that we get the summation

of all the inputs with the respective inputs? If you see x 1 multiplied

by W 1 so x 1 value is 0 and W 1 value is 1 so this section becomes 0

plus again for second term, which is X 2 X W2 because this is 0 this term

again becomes zero and the total output

for the first example is 0 now. We have put in

a threshold activation function where we have fit in such a way if the output coming from this is greater than 0.5

then give me the output 1 L 0 and because the output

for the first example, which is here is coming 0

and 0 is less than Point 5, which is the threshold

your model would be giving the output of zero

the same process. If I do for the second example x

1 multiplied by W 1 would be 0 because x 1 is 0

plus 4 II X 2 X W2 As with this is 1 and this is 1 the output is 1

and then I again pass through this one

with activation function which is point 5 and 1

is greater than 0.5. Then your model

would be outputting one. So that’s how for each one

of the examples you have model would be multiplying the inputs

with the respective weights and with the help of

threshold activation function. You would be measuring what output should be given and your model would

be able to classify in which cases your model

should be outputting one in which case it should

be outputting 0 Let’s look what is the process really look

like and whenever saying trying to draw different line if this is the case where there are some blue dots

in there some green dots there can be infinite number of lines which can pass through

we are showcasing three, but any number of line

can pass through and remodels objective should be that it should choose the line

which is the best fitted means which is giving me

the least errors in terms of prediction. So how does it learn

there are few steps involved. So we have some inputs

and I’m talking about scenario where we need to produce

some kind of classes. Let’s say you need to predict whether something

will happen or not. So you have some inputs

in terms of your features X1 X2 X and features and you

have your dependent variable or class variable y. So, how do you do that? First? You will initialize

these weights and the threshold or any activation

function you can say that I am putting all my weight

is equal to 0 or you can say that I am randomly generating

numbers for my way. Between zero mean and one standard deviation then

given this randomize weights you have inputs in terms of x’s and also the output feature and then you make predictions

after multiplication of respective weights

with respect to features and applying the activation

function and then compare how good your model is doing. So how would you model

is doing you compare if the actual value

was one your model was predicting one or not. So this is the misclassification

error you calculate and then this equation will look

very confusing and intimidating in the beginning but it is not very difficult. We have calculated the error an error would be

difference between actual – predicted values. So this error we

will be calculating and then we will be updating

our weights in a way. So this this error get minimized

and this equation is used in terms of updating

your Weights in the direction so that your error term

is reduced at each eye tration. This equation would be used in updating all of them

in the direction so that this error

term is minimized. What really happens is that bait at the next I duration

and J is like, you know either W 1 W 2 W 3 J

is just representing that wait wait

at the next I tration would be weight

which was earlier plus ETA actually eat ice

called The Learning rate. And then this is the error term

which is the desired – actual and then

multiplied by the input. But basically your weights would

be updating in the direction so that the total error

from the model is minimized and this process between 2 & 3

would be repeated till the time either the error

has stagnated or also, you can repeat this process for

a fixed number of iterations. Can say that? Okay repeat this learning

process 400 I trations. This is an overall process. And actually this

does not change for anyone of the learning algorithms

either you’re talking about machine learning algorithm or you’re talking

about a deep learning algorithm or you’re talking

about unsupervised actually the process remain exactly the same in which you

initialize some weight. You have your inputs. You measure how good

your model is doing and how good it

in terms of errors and with the help of error you

keep on improving these weights so that the error is minimized

and this process as I said is common across all your machine learning

and deep learning algorithms. So this is as I think

it is the process which all algorithms follow

in terms of learning. Let me give you an introduction of two new terms first thing

is called learning rate. What is in learning

rate learning rate is an amount of size your weight should

be changing and let me give you an Div understanding, what does it really mean? Anyway to W in the next

iteration would be changed in such a way that from this equation

E. Te this is some value you will be finding

from some process. Let’s hit this is

some constant term 10 if I change the value of this eita the speed at which

my W1 would be changing it. I tration would be different

and what I mean if I keep ETA at Point 1 then in the next I tration

my weight would be adjusted with Point 1 multiplied by 10. It means equal to 1 so in the next iteration my weight

would be whatever weight it was and I’m adding this term which is equal to 1 however, if I put ETA

which is again a value which you would be providing

a model if I fix that ETA is equal to point zero 1 then what is

the total output from because this is you can think

of it as a constant 10 then multiplication is point

1 in this particular way. The next I tration your W1 has

been changed only by point one. So your ETA is the step size or the step your weight would

be taking in terms of changing. If you keep this ETA High then your weight would be

changing more frequently and more aggressively. If you keep this

slow your weight would be taking smaller steps

in updating them though. Both of them in going

in the right direction. Let’s say we need to update it. But this particular way when we keep the ET

8.1 your weight is changing one at every step. However, if I keep it very

simple your weights are updating with it smaller steps and Eda is been used

in terms of controlling how fast and slow your model

would be learning. Another thing is a pox so till now we have been

discussing these words called titration So

when you say hydration that I’m running

it thousand times, so basically I am saying that I’m running

thousand iterations of it. So the weight adjustment

And we’ll be happening through learning from all

the training examples. Let’s say you have a hundred

training examples in your data and the errors you

will be calculating from your initialize weights

on all these hundred examples. So like we have seen

in logistic regression. We saw what is

the accuracy percentage and accuracy percentage is

again one of the ways of seeing how good your model is doing. So this learning process is taking all these

hundred examples going through all of them to change your weights

in the right direction. So a pox means that how many times you go

through all these hundredweights in the learning process. So you will be deciding that you need to run

your optimization algorithm for how many I duration so let’s say

you say I want to run it for 1,000 times. So your model would be going through training examples 1,000

times and changing the weights and after 1000 iterations. You will be saying

what is The final weights and this hydration

is called an epic. So this is the IPython notebook

and let me Zoom it a bit if I can. So first thing which we need to learn here

is called the loss function but the loss

function is basically if I take intuition path, it is an indication that how good

your model is doing if there is loss of loss. It means the model

is not doing good. If it’s accurate

or hundred percent accurate, there is no difference

between actual value and the predicted web. So this is the definition which

comes from tensorflow websites, like a loss function measures how far apart the current model

is from the provided data. There are few Matrix, which are being used

in measuring how much the loss is and it’ll depend

the kind of problem. You’re going to

solve for example, if you are working

with regression problem where you are outputting

some kind of numerical values MSC mean squared error is

something loss function, which is So we’ve

been used though. You can also use MSC in terms

of classification problems where you take 1 or 0 as your output value

and the probabilities, let’s say you are using sigmoid which would be

between zero and one and you take the difference

between actual value and the predicted value though. It is possible

and you can use it but there is another loss

function being called cross entropy entropy is

a loss of information. If you talk about

in Signal terms and cross entropy is

in loss function, which is been extensively

being used for most of the classification problems. So let’s take

a toy example here. We are trying to see

that what is the loss if you see that we are importing

the tensorflow stf. We are initializing

two variables and we are just fixing the value for now

Point 3 and minus Point 3 4 W and our bias term we

are initializing X and Y and what we are doing is

we are writing an equation which is like Output is rate

multiplied by the input feature plus the bias term

and square Delta, which I’m calculating here. Actually, this is the difference

between the actual value which is y

and the predicted value which is the linear model and TF naught square

is helping me taking the square of each difference why

we’re doing Square. It’s the two reasons

we take square one. We will be seeing

at the time of optimization that we take some derivatives. So taking derivative

Office Square term is pretty much simple and in comparison

to absolute value and second. They may be a difference

between actual value and predicted value

which may be, you know, some case it will be negative

for some example du+ and if I submit a for all

the examples this difference, it may come to 0 you

may have this understanding that your model

is doing a great job, which is not and the loss

is you are taking squared. Sum. So here we have taken

for all individual values and reduce some actually. Fluids taking summation

for all the examples we take and then all the values

coming from here. We’re taking the square

this line if you recall was for initializing all

the variable data type in the model. So Global variables initializer help me initializing

the values of w and B, and then you run a session and then you need to provide

these values with the help of feed dictionary and I’m saying X is equal

to 1 2 3 4 and Y is equal to 0 minus 1 2 minus 3,

and then you run. So what your model is doing? It is producing the output

with the help of input which is X and the weights

and bias term which I have initiated

on top Point 3 and minus 1/3. It’s multiplying the weight with all these values

then adding the bias term and then we are taking

difference for all the four examples we had

and this is the total loss. We are getting from this model. So loss function is

kind of indication that how good and bad your model is doing now comes

the process of optimization. And this is the core

of learning one way to calculate the difference

between actual and predicted if the variable you’re trying to

predict as a numerical variable like in the case of regression, you can take help

of the square difference algorithm and it

will be useful in predicting how good a model is doing. So it’s the square

difference but in case when we have categorical values, we use a different function

power different loss function and most of the time we use

cross entropy and let’s take the cross entropy first. So I have created a very simple and with a very small example

just to give you an intuition that how does your model learn how much the cost

is using cross entropy? So I’m taking an example. Let’s say you have two features. X1 and X2 do it’s the same values but let’s say

I’m just calling it x and x 1 and there are three classes

in your output variable class 1 2 3 and there are

seven customer data you have so because I know there are

three distinct classes in my dependent variable. I can represent it with one hot

and coding something like this that I can represent class 1 as 1 0 0 class to

its 010 class 3 0 0 1 and this thing would be

repeating so class 1 again 1 0 0 and so on and so forth

for all the examples we have now just this section you think

that we have got some output. So if we have two features

or two variables x and x 1 then there would be

two weights associated with them and we will be producing

some kind of output. So let’s say here

the weight was W and here the way it was W1. So these are

two weights respective to each variable and whatever

the multiplication is coming w. X xw 1 multiplied by x 1 this is the output

you are getting then if I have got these outputs and given these are

representation one hot and coding at the time when I was explaining

the activation functions. I mentioned that softmax is

an activation function what it does it it

converts the output coming for each class in doing

very nice probability term in the same way. We have sigmoid

activation function. So for an example

if I just take this example, which is highlighted here that for customer one

for three classes. This is the total score came in

and what is the function if for softmax was that you exponential

of the value and divide by summation

of all the exponential values. So what I have done is I

have created exponential of all these values

individual values exponential of 3 exponential

of five exponential F2. And then I Converted in terms

of the probability term. So this 20 divided by the sum

of all three values. And now if you see

the summation should be one for all the three classes and basically what

your soft Max tells you that wherever the

highest probability is. Your model is predicting

that particular class. And in this example, which I’m highlighting

here actual value was this it belong to class one. However, if you see

our model is predicting highest probability for Class 2, so probability for

the second class is highest and the model output

would be actually class to but our actual purpose of discussing this particular

example was the loss function and the function

for this is very simple. It is summation of Y

multiplied by log of p and if you see this function

closely when I say so y multiplied by log of p

and p is the probability for that particular class then

we need to do multiplication. Only for the instance where y equal to 1 rest

of the things it become automatically 0 because y becomes 0 let

me take a step again. We were presenting

the first example as class 1 and these three is 100 if you look here all

we are doing is value for only first class. It is coming .94

and rest of them. It is 0 given for

these Class 2 and class 3. This is already being

represented by 0 so crash on trapeze also a very

efficient method medical operation for identifying

the cost only measuring where Y is equal to 1

and intuition terms. What we are doing is that your model

one was expecting one you predicted point one one. What is the total cost? And the function is if the output for

the particular class is closer to 1 and you know that log of one is what 0 then

the loss for that particular. Example would be 0 however

here we we expected one which is the actual value. Your model is giving Point 1 1

so this is the loss if you see the function

in end is nothing but with a negative

in the beginning E, 3 multiplied by log of Q 3 and Q

3 is the probability coming from the softmax function

and we will be calculating. If you see that we are getting

for second example, it was class 2 and 4 class to the probability is point

2 4 and this is the loss we are getting let

me see for one of those where the loss is

very close to 0. So if you look at this

we were expecting it was the class to example. However, the probability is very

very close to one almost 1 given that all other values

are very negligible and loss becomes 0. So what cross entropy

does is if your model is predicting close to one

for a particular example, then it’s very less. But if you predict

lesser probability the loss. Hired an end of the day. All you do is for calculating

loss of the entire training data set to sum up all the values from these cross-entropy terms

there are different in other ways of calculating

cost and loss function. However, the MS

for numerical outputs and cross entropy for

classification output is most of the time being used

as a loss function. So optimization processes

that you update those weights which are associated

with input values in a way that your loss is getting

reduced and reduce so you would have seen

this kind of implementation and explanation from

people like Andrew NG and Geoffrey Hinton and if you’re reading

some machine learning book, they also have taken

the similar intuition. What we have done is we

have created more tutorial only but think of it it’s an example that somebody is no

on top of a hill and he wants to come

down on the ground. Devon and he’s blindfolded and only thing he has he has

a stick like blind people have but he wants to

come down directly and the only objective he has that he should be able to come

down as quickly as possible. So how he can decide that which is the correct way

or right way of going downwards. Let’s sit there for ways. You can go down he

can take this path. He can take this path. He can take this path. There are other ways which would be the other side

of the mountain. So what this person can do is from his take he

can just step around and see whether he’s going downwards or it’s a flat

in a particular direction and if you see

intentionally we have kept that the direction

of the slope is difference if you see this is a kind

of more steeper in comparison to this this slope, which I’m hovering over so if he Taps around

in four directions, he can see okay, which is the more steeper one and if he takes

the steeper path, he would be able

to reach the ground more. Click the same intuition is being taken by

your learning algorithm. And this person wants

to reach the ground level as soon as possible. Your model subjective is also that you should be able

to reduce the loss or cost term as quickly as

possible and how you can do is if you improve the weight

in the best possible Direction, then you’d be able to reduce

the loss more quickly. And if I take this intuition, let’s say he Taps

around and see the okay. This is the direction which he’ll take

him very quickly downwards then another intuition that how far he can go

whether he can, you know, take 2 meter step or he can take one meter step

or just half a meter step that would be controlled by as we saw in terms

of it learning rate. So how big a step he’ll be

taking they’ll be decided by the Learning rate in algorithm. So let’s say he can take

this bigger path in one instance and he has seen okay. This is the more steeper one. So in one step he

reaches here and then what he does He Taps

around again and see which is the steeper Direction and then take one more jump or

one more step going downwards. Then he steps around

and then see which is this direction

he needs to move in and then he takes another way when I say he Taps around and

see which direction he moves. It is been decided

by the algorithm and which will be using which

is called gradient gradient is first order differentiation, which help you understand then

in which direction your function should be moving

to minimize the cost at each step it keep on

doing the stepping around and then he reaches the bottom

the same intuition is been taken by your learning algorithm, which is called

a gradient descent. And when you say

reaches the ground and our algorithmic terms, we can say that we have stopped reducing the loss function R

loss function has been reduced to its minimum and it’s not reducing any

further this minimum value Where You Are Function and got minimum

value is called Global Minima. So intensive flow for doing this stuff is called

gradient descent Optimizer and there are

a few variants of it. It is the most

basic optimization algorithm for helping

your model understand that which direction

your weight should be updating. We have initially

started W with .5 and be at Point 1. So this is the initial value

you can think of that you have randomly

assigned value of 0.5 to W and point 1 to B, and we have created

two placeholders for our input and output feature and we have created a function

for calculating the output for a given example. This loss function is the same like we have used

in the last explanation. We are taking the difference

between actual value and the predicted value

and taking square of it. And then we are calculating

the total loss by taking some of and now intensive flow to Loss and updating these weights

in the direction so that these

weights are minimized. I’ll be calling an optimization

function called gradient descent and I can easily Call

It by TF not trained or gradient descent Optimizer. So this is the way you call

the gradient descent Optimizer this particular value if you see here this is

the learning rate value that how big the step

your model would be taking in terms of updating. So whatever value

initially starting here how much it would be changing. So step size would be decided

by this then you need to train your algorithm

and what you need to do. You need to call this

optimization function though. It can be done in one line, but I have put it into lines you

could have put it minimize. This loss here itself. But what we are doing

is we are creating an optimization model

or a function. We are calling and then

what Our intention is that we should be able

to minimize the loss function. Which is coming from here, which is sum of squared distance

between actual value and predicted and how we do it. We do it by calling

this optimization function, which is a gradient descent. Then we are calling this function for

initializing the variables. So we initializing .5 and .1 starting a session and we are giving the value

of input and output and then I am printing the value of w and v in this is

the most important thing which you need to think

of these two lines for I equal to 1

in the range of three. So what we’ll be doing will be

running this model 3 times 0 because but this kind

of line what it does is If I do it’ll be actually all

we’re doing with running it for three times. But if I do it print so all I’m doing it

is producing three. So we all we are doing is

for every X we want it just run the train function

on these training examples, which we have end

when it’s done it because it’s out of the loop. What will be proud putting

is whatever the learnt weight of w and B are and if you see that initially

the value of w 0.5 and beaver point one and then it

has been changed to W8 s change to minus Point 5 4 and B is value has been changed

to minus point two four and if we run it

for more number of iterations, it will be changing further. So if you see the loss

when you had weight at Point 1 this was the laws

when you have point to this was the laws when you had

this this you keep on seeing the and at w equal to 1

your loss was Zero. But if you increase it again 2.1 and keep on increasing your loss

is again keep on increasing because now your predictions

are again going in the opposite direction how your model see it

actually something like this. So initially that was

your total loss what your model does

this gradient descent. It takes a derivative at this particular instance

and see which direction it should move and it is okay. This is the direction

you should move so you need to increase

the weight and increase the weight by how much one

would be the gradient value which would be

coming from derivative. But also what you are

multiplying the learning rate. So let’s say you have fixed

the learning rate of point to or whatever number

you have fixed. So this is the step size

you keep on taking so from here you have reached

this particular value and then at this particular

point you again take the gradient and see which direction your weight

should be moving your learning rate help you how much the steps

right would be and then it changes

here and each step. You keep on

calculating the loss. All the cost of it when it reaches

their your gradient become horizontal and your model

will not be changing the weight and you can say a model

has converged why I have shown you both the ways that it is a perfect possibility that rather than initialing

your weights at Point 1 you could have started

with weight equal to 2 and then your model

does the same process. It measures the gradient see

which direction it needs to move and then keep on going

in the direction where the loss is going down. Basically in the text. You won’t see

both the directions most of the time you get to see

half of the picture. This would be kind

of your loss function when you’ll be actually

implementing a model that you start from a place

in your gradient descent algorithm would be helping

which direction is to move and with the help

of learning rate. You’ll be deciding

how would be the step size and you keep on going

in the direction and once your model has reached or model has kind of

identify the weights where the loss has become

almost Most negligible or very very less you have this line going

something like this, but this process is applicable because as I was saying you

are supposed to randomly initialize weights and is

it’s perfect possibility that rather than starting

the weights from point 1 you could have started

from as a weight of two or three whatever weights are there then even

in any one of those situations the learning process

remains absolutely same so there was the reason I

wanted to show you both the size but this is what happens

in the back end. So initially we started

with weights of .5 and point one and we created a loss function as sum

of squared differences. Then we called

a gradient descent and what gradient descent is

actually doing is it’s changing the weights from these values which we have initially

given to our model. These are been changed

in the direction so that the loss is going

down further and further. So this is the way your model

learns and loss function and optimization function. These are two most important. Building blocks and this is one of the very important stuff

in machine learning and deep learning so

minimum value in any function. If I look at this is

the function I have if you look I can start

my values from anywhere and I can go to Value anywhere. This would be my type

of function and people who come from mathematics

background would know that this is a convex function because it is squared

difference between values. We always get

this kind of function. You can change the W

in any direction you will end up getting this kind of graph

only there is no other way that you will be getting

any other kind of graph and people who know it

it’s quadratic equations have given that this kind of function has only

one minimum value you start from any direction you can start from plus infinity

to minus infinity anywhere and keep on changing

the weights there would be only one minimum value

in this kind of function. So convex functions always

have just one minimum value and given it has

only one minimum very that’s called global. Because this is the minimum

value of your function itself. And if you are lost function or cost function f

of quadratic function nature, there can only be of one value. And once you reach this value, you can run your model

for thousands of I trations. This weights will

not be changing because the gradients has become

parallel and they would be no change in your weight. But actually this

is not always true because you will be dealing

with different kind of cost functions. I have used

the squared one here, but let’s say you are using

the absolute difference one as your cost function. Then there is a possibility that you end up getting

this kind of slope or this kind of graph

for a cost function. And when we say local Minima

this particular point, which I’m hovering over now, it is let’s say you have reached

this particular place while learning the example and

if you take the derivative here or the gradient here it will

again be parallel to the axis. So your model may be thinking that you have reached

only the minimum value possible for your cost and it stopped updating

your function or your weights and all these values here

are called local Minima. And if in

this particular function, this is actually

the global minimum because this is the most minimum value possible

for this particular equation. So if you using kind of

Squire activation cost function, then you don’t have to worry

about the local minimum, but you should be aware

about these things that they may be a possibility

for using different kind of loss functions. You may have local Minima

in your cost function and there are different ways

of dealing with it. Like you initialize a weight

most randomly as possible. So there is less probability

of getting at a specific place or there are other methods that you can change the type

of loss function using. So now what we’ll be doing is

I’ll show you an implementation in tensorflow for one

of the use cases, which We have already

implemented in the case of logistic regression using the same data set

which we had earlier. So we are importing

these libraries and these are the same basic libraries

only one Library, which is new is

this tensorflow Library rest of the stuff is same then we are reading the data

the sonar CSV data which we had so we had values of Y is

something like this and now we have created

one hot and Coatings 40. We are presenting it 1-0 and 41. It is 0-1 and 401. So given this next we are coming

to the normalization process. It was important in machine machine learning

to normalize the features to either with a zero mean and one standard deviation

or between 0 & 1 with Max men or any other normalization, but you need to bring

all your input features to one scale and this is one

of the Important step which needs to be taken before you implement any one

of the deep learning algorithms if your features

are not normalized your model will learn one. It may take a lot of time and also it will be taking a lot

of time to converge when I say converge

like to identify the most optimal weights here. We are doing normalization

the normalization process. It’s like taking the value

subtracting the mean of it and dividing it

by the standard deviation of the data and what it does it

will be normalizing the features and we have done it

for all the features in one go we will be calling

this function to do this task appending the bias term. So what we do is

we introduce a column as I was saying that we introduced a column with all the values as 1

so it will be something like we have introduced a column

of this same length as other features are and here the values

would be all the places one. And as I was explaining

the purpose of doing this is so that we can do this

implementation pretty quickly because now we can do W 0 With X 0 and W 0

as I was saying earlier, we will be introducing

as a bias term which we had as B and now because we will be doing

the matrix multiplication. We can easily do this summation from I equal to 0 to n n is

the number of features you have and you can do is wixi and this is for plotting

the values we have in the data set so that we can see how our two classes

one and zero is be in intermingle with each other whether these are

linearly separable and that’s one process

you can use it to see whether you should be able

to use a logistic regression for a given problem or you

need a more complex algorithm. And if you see that all these values

between the Red Dot and the crosses they

are kind of intermingled and there is no one line which can separate those. So either we need

a complex algorithm or people who are very much inclined towards using

machine learning algorithms. You need to do kind

of data feature. And so that those features

become linearly separable, but here we are

using neural network, which would be helping

us separating these two classes. That’s how your

x-values look like. So these are all the features

we have and if you see that these are the values of our y variable

these are all ones and that’s the reason why we

need to shuffle our data set and we do it because that they may be a case

of something like this where all the values are

on top or of y equal to one examples and all the bottom examples are of y equal to 0 so

shuffling this with kind of distribution equally that we don’t have

this kind of distribution and then we are dividing

our data into two groups as we have already seen

in the regression and logistic regression that we are splitting the data

into two groups train and test so 80%

for training and 20% for test site random state is like so that you

receive the same examples at the same time, and these are the shape that 165 is Number of rows and 15 n number

of columns you have and in the training and this

is your training why so now if you see it’s too because every y value

is being a dummy coded so 0 as I was saying 1 0

and 1 as 0 0 1 so first we initialize some

of the hyper parameters and as I said, we are just

introduced to will be when we go through we’ll be

learning more of hyperparameters and we have introduced

to only for a reason that only these two hyper

parameters are required. So first thing learning

rate learning rate is we have understood now

it is the step size or the amount of change in each iteration your weight

would be dependent on it was something

like that wait at t plus 1 iteration would be equal

to wait at iteration T minus Alpha aurita and which was

like both the learning rate and multiplied by whatever

the gradient is coming. So that’s how a way

Get updated but your alpha or learning rate plays a role that how quickly a wage

would be changing. We should always be putting the maximum value possible

for learning rate. So that your weights been learn

as quickly as possible and it’s not a bad assumption

or bad thinking but there is a reason that we always be striving for the optimal value

of our learning rate. And now let me take

an example to show you. So let’s say here we

have taken some specified value of learning date, but it is possible

that suppose we started from here and you have taken

a very big learning rate. So you have increased

the learning rate to a very large number and you started from here

on in were nitration rather than reaching here. You have reached let’s say here and in one iteration you

have changed the weight from point 1 to point 7 and our loss has also

drastically reduce earlier. It was around 300

and now have reached which is almost close to 0. But it is not the

optimal value right? There would be

some gradient here as well. And if you are value of a learning rate is

very high in the next one or the next step you

will be reaching here because your learning rate is high and then you

have reached here. You take the gradient. Okay. Now you need to reduce and probably next time you

reach here and this process which is like keep on hovering over you may reach

the minimum point then it will stop but there

is again a possibility that you kind of oscillating

between the minimum value or the optimal value

in the non optimal values and you may not be moving

in the direction where your weights

get optimized. Finally. This kind of problem is called

Model Divergence in the text and some of you may be thinking that it’s not a

very common problem. I can tell you with experience that this is one of the most common problems

in making your own models work and there are different

ways of deciding what should be the learning rate as a rule of thumb

you would have seen that people would be using

somewhere between Point one to point zero one

and it’s variants that some people

would be using .05 and sometime it’s

like point six or something. So this is the actual range

people use for learning rate and it’s not a bad selection

actually Point 1 to point 0 0 1 so we have already seen that if we have a very big

learning rate we end up getting into this kind of scenario that we never reach

the minimum value of our loss function. We keep on oscillating

and we can see it whether you’re learning

rate is good or bad because we will be storing in the example. I’ll also show you that why we

store the loss of your model at each other Asian

or each a pox to be precise so that you can see that your model

is really learning for its kind of oscillating in

between your loss functions. Trend would be helpful in seeing whether it’s doing

good job or not. Okay. Now I know that bigger learning rate

may create this kind of problem. Why not choose

a very small value. Let’s say select this

as a learning rate. And in this way, I know that I’ll

be taking smaller part. I will be taking more time, but actually eventually

I’ll reach the minimum value but I also said in most of the real implementations

you will be setting up that how many times

you need to run your model and let’s say you

have said thousand why we do it because when you start

doing your calculations or implementations on GPU, you have to be really conscious

about how much time your model is taking to learn and if it takes too much of time you end up

spending a lot of money which is not worth and if you have chosen a very

small learning rate your model would be taking very very

tiny steps in kind of do it’s going towards the minimum

value of your function, but it’s taking very merry stop

very simple short steps and you end up taking lot

of time in reaching their and let’s say these are

thousand step in your model has reached only here and you

end up using these weights. So these are two reasons that you have to identify

the optimal learning rate. And that’s one of the most

important hyperparameters. As for learning the model

then this is training a box and this is the number

of times a number of iterations. You need to run your algorithm

for so we are seeing run it for thousand times and here I am setting

up an empty array and this would be helpful. As I said that we will be measuring cost

at each iteration to see whether your model

is doing good stuff and learning and how do we

know it’s doing good stuff that your cost would be reducing

at each eye tration or epic to see that your model is learning the weights in the correct

direction number of a pox. I also want to mention

one more thing some similar to learning rate. If you set up too many, let’s say you set up

this big a number of learning a pox then your model we

taking too much of time and if you take two less

like the say 5 or 10, then you will not be able

to reach the optimal value because that you have reached

only lets say this position and this is not the optimal

for your model. So we need to identify actually you need to play around

with these numbers so that you take optimal steps so that you Reach a minimum

value with minimum number of epochs or iterations and it can be very helpful

from the point of view that your costs functions graph

would be helpful in kind of telling you how good you are doing in terms of both these values

as a rule of thumb we should start with

a reasonable good learning rate, which is point one

and you should start with like hundred or two hundred training a box more

not more than that and only after looking

at the cost function and graph I’ll show you that how do you interpret it? Whether you need

a bigger learning rate or whether you need to need

more air box you can take a lot of help from graph

of the cost history. So now we are putting what are the dimensions

of your input shape. So you’re getting

number of columns. If you see one is four columns and 0 was Furrow and number

of classes are too because we have two classes

now 01 being presented into two categories 0 1 and 1 0 because we

have done one hot encoding you need to Specify the x value

and that’s what I’m saying. It is very important that you need to

specify the dimensions. We are creating X

as the placeholder like we have already done

for smaller examples, which would be a float 32 types, which we have

four decimal values. This none means that it can be

of any number of rows. So in this particular data set, we are training our model

we have let’s say 159 or 200 training examples, but tomorrow you have 300 so you don’t have

to change this Dimension. If you fix it,

then you have to specify that. How many rows are there, but if you keep it none then

your algorithm understands that it is changeable number and whatever number

of a rose I get just accept it as it is and number

of Dimensions you’re telling that how many input features

you have into the model though. We are deviating on top of we have specified

column for our biases, but we are doing the simple

implementation as earlier, but the more important point

to take away here is this we are initializing. With all the zeros and let

me tell you it is not a good implementation. But here we are initializing

all the initial weight at zeros and that would be of what dimension right need

to do matrix multiplication. These two Matrix has to be

some kind of specific dimensions and if I need to do

matrix multiplication, let’s say this is n by m what should be

the dimension of this Matrix at least M. And then I may have

any number here. If number of columns of the first Matrix

is equivalent to the number of rows of the second Matrix then only you can do

matrix multiplication. So that’s why you

have to make sure that weights you

are initiating should be of the correct dimensions and that’s why most

of the people make mistakes if you interchange the numbers if you see that this is

the number of columns in this is the number of rows. So that’s why n

if you make mistake here, you will be getting

lot of Errors. Then I’m shooting

this bias term with all zeros with number of classes there

to number of classes. So one bias term for each class. We are just Using all

the variables in the court and defining the Y variable

the output here. So we are calling

the softmax function. And what we are doing is

the predicted value is d f dot met mul of X

and W plus the bias term and the softmax what it is doing is so

this is the place holder for your actual values

and your actual values. We are doing the same place

holder the F32 none is again because if you see here

for X variables, we are saying that it can be any number

and you would agree that we need to have

as many number of rows for input features as

many dependent variable because one would be

corresponding to each other and number of classes there

to number of classes now, so we are doing

two operations first. We are calculating

the total output which are showcasing

in Excel file that what is the total output for each class. And then I am generating

probability using the t f dot n +. Softmax function. So now we are

doing this calculation and we are implementing

the cross entropy though. There is a function

itself as cross-entropy. But here we are

trying to Showcase. We are multiplying y

with the log of the probability. We are getting for each class and you end up getting

number only for the class where y equal to 1

and then we are taking this negative and reduce

some production indices. If you call one is by columns, we are summing it up

for all the columns and then taking the mean

of all the value. So this is the total cost and then we are calling

the train Optimizer. We could have done the

in the deer example in one line of code. So we are calling

gradient descent Optimizer with the learning rate. We have already

initiated as point one if I’m not wrong and then what I need to do is I need

to minimize the cost function which I have defined here. Then I initialize a session and I initialize

all the Variables in the model MSC history. This is we need to store. What is the loss is

at each iteration. We are starting

with in empty list and then it’s the same function which you have doing number

of training examples. We have done thousands

of all we are doing is that run? These codes you could have used

I or anything but because we call it a box using

for air box in the range of thousand so run it

for thousand times what you need to run you need to run

the training step which was the last step in our case and all

you guys know by now that everything intensive flow

is kind of a graph and if you run the last calculation

all other calculations will be on top would be done accordingly

and feed dictionary. We have already defined

what a train X and train yr. And we can calculate the cost

for feeding into our dictionary so that we can see

how does it look like and we keep on defending

this cost to our each iteration so that we can see

how does it’s doing and here in the line of code. We are predicting. Once it has done

make the prediction for the testing ones. It should run should print me

three values the airport name and the total cost and this is actually

a more important to see that whether your cost

going down or not. So let me go at top. I have run this code all ready for you with may be

of different scale. Sometimes you have

different numbers, but if it is reducing at each I tration then probably

it’s an indication for you that your model

is doing the good job. Your weights are being learned

in the right direction and your loss is reducing so we started with .68 and if I keep on run it

for thousand times, it’s reducing almost

they may be fluctuations. Not everyone would be reducing

but if you overall see that it has gone down

then it’s an indication that your model is learning

and we have just created a plot of it all

the Thousand nitration values. We have created this plot

and what it tells you one. Your cost is going down. A pretty optimized way but it

is going down pretty neatly and I can see that this learning rate which we have selected

is doing a good job in terms of reducing an eye

at the Thousand iteration. It kind of still going down. So had it been

something like this if your graph looks like something like this

then it’s an indication that your model doesn’t require

a lot of iterations because your loss function has kind of tapered and it’s

not learning any further. If you are lost function looks

like this on the case of number of epochs then

it’s an indication that your cost is going down. Email is again evaluation

function in tensorflow, and this is again. We are plotting the same cost in

if you see that in my opinion, it’s it has not completely

flattened do it has reducing alert, not much, but still I see there is a scope that we can work

with more number of f ox and one thing. You can try at your hand when

you’ll be running these codes that you increase

the learning rate at your end and reduce the number

of epoxy and see the best shape of your graph looks

something like this. Thank you. So thank you for

the great session. I’m it. I hope all of you

found it informative. If you have any further queries

related to the session. Please comment in the comment

section below until then that’s all from our side today. Thank you and happy learning. I hope you have enjoyed

listening to this video. Please be kind enough to like it and you can comment any

of your doubts and queries and we will reply them at the earliest do look out

for more videos in our playlist And subscribe to Edureka channel to learn more. Happy learning.